Before you can analyse anything, you have to get the data in. In the real world it is scattered everywhere — a spreadsheet from finance, a database the engineers own, a web page with no download button, a live API that updates every minute. This module gives you a reliable way to pull from every one of those sources into Python.
1Reading files with Pandas: CSV, Excel & flat files
In Module 1 you read a CSV the manual way to understand it. From here on we use Pandas — one line loads a file into a DataFrame, a powerful in-memory table. (We only load here; cleaning comes in Module 3.)
import pandas as pd
# CSV — the analyst's everyday format
sales = pd.read_csv('sales.csv')
# Excel — choose a specific sheet
budget = pd.read_excel('budget.xlsx', sheet_name='Q1')
print(sales.head())
print('Shape:', sales.shape)date region amount 0 2024-01-01 North 1200 1 2024-01-01 South 980 2 2024-01-02 North 1450 Shape: (3, 3)
Read smarter with options
read_csv has dozens of options. These four solve 90% of messy imports:
orders = pd.read_csv(
'orders.csv',
usecols=['date', 'region', 'amount'], # only the columns you need
parse_dates=['date'], # read dates as real dates
na_values=['', 'N/A', 'NA', '-'], # treat these as missing
thousands=',', # 1,200 -> 1200
)
print(orders.dtypes)date datetime64[ns] region object amount int64 dtype: object
read_excel requires the openpyxl package — install it once with pip install openpyxl. For tab-separated files use pd.read_csv('file.tsv', sep='\t').pd.read_csv()andpd.read_excel()load a file into a DataFrame in one line.- Use
usecols,parse_dates,na_valuesandthousandsto import cleanly the first time. df.head(),df.shapeanddf.dtypesare your first three checks on any new dataset.
2Pulling data from SQL databases
Most company data lives in a relational database. You connect with a connection URL, run a query, and Pandas hands you a DataFrame. The same code works for SQLite, MySQL and PostgreSQL — only the URL changes.
| Database | SQLAlchemy connection URL |
|---|---|
| SQLite (a local file) | sqlite:///company.db |
| MySQL | mysql+pymysql://user:pass@host/db |
| PostgreSQL | postgresql+psycopg2://user:pass@host/db |
import pandas as pd
from sqlalchemy import create_engine
# SQLite needs no server — perfect for learning
engine = create_engine('sqlite:///company.db')
query = '''
SELECT region, SUM(amount) AS total
FROM sales
GROUP BY region
ORDER BY total DESC
'''
df = pd.read_sql(query, engine)
print(df)region total 0 North 2650.0 1 South 980.0
You push the heavy lifting (filtering, grouping) to the database and pull back only the summary — fast and scalable. You will write serious SQL in Module 7.
import os
url = ('postgresql+psycopg2://' + os.environ['DB_USER'] + ':'
+ os.environ['DB_PASS'] + '@db.example.com/analytics')
engine = create_engine(url)- A connection URL +
create_engine()connects Python to any SQL database. pd.read_sql(query, engine)returns query results straight into a DataFrame.- Let the database do the filtering/grouping; keep secrets in environment variables, not in code.
3Web scraping responsibly with Requests & BeautifulSoup
When there is no download and no API, you can read the data straight from a web page. Requests fetches the HTML; BeautifulSoup lets you pluck out the bits you want.
import requests
from bs4 import BeautifulSoup
resp = requests.get('https://example.com/products', timeout=10)
resp.raise_for_status() # stop if the page failed to load
soup = BeautifulSoup(resp.text, 'html.parser')
rows = []
for card in soup.select('.product'): # a CSS selector
rows.append({
'name': card.select_one('.name').text.strip(),
'price': card.select_one('.price').text.strip(),
})
print(rows[:2])[{'name': 'Mechanical Keyboard', 'price': '799'},
{'name': 'Wireless Mouse', 'price': '349'}]You build a list of dictionaries (one per record), which Pandas turns into a table instantly: pd.DataFrame(rows).
robots.txt and Terms of Service; scrape gently (add a delay, do not hammer the server); identify yourself with a User-Agent; never collect personal data. If an API exists, use it instead — it is faster, allowed and more stable.requests.get(url)downloads a page;BeautifulSoupparses it so you can select elements.- Collect records as a list of dicts, then
pd.DataFrame(rows)for a table. - Scraping carries legal/ethical duties — respect robots.txt, rate-limit, and prefer an official API.
4Consuming REST APIs & parsing JSON
An API is a clean, official doorway to live data. You send an HTTP request; the server replies with JSON. This is the most reliable way to collect fresh data.
import requests
import pandas as pd
url = 'https://api.exchangerate.host/latest'
resp = requests.get(url, params={'base': 'USD', 'symbols': 'INR,EUR,GBP'}, timeout=10)
resp.raise_for_status()
data = resp.json() # JSON -> Python dict
print(resp.status_code)
print(data['rates'])
# flatten nested JSON straight into a table
table = pd.json_normalize(data)
print(table[['base', 'rates.INR', 'rates.EUR']])200
{'INR': 83.2, 'EUR': 0.92, 'GBP': 0.79}
base rates.INR rates.EUR
0 USD 83.2 0.92resp.status_code (200 = OK) or call resp.raise_for_status(); respect the API's rate limits; and handle pagination (many APIs return data in pages — loop until there are no more).- APIs return JSON;
resp.json()turns it into a Python dict/list. pd.json_normalize()flattens nested JSON into a tidy DataFrame.- Always check status codes, respect rate limits, and follow pagination to get all the data.
5Authentication: API keys & OAuth tokens
Most useful APIs need to know who is calling. The two common schemes are a simple API key and an OAuth bearer token.
| Method | How it is sent | Typical use |
|---|---|---|
| API key (param) | in the query string: ?apikey=... | simple public-data APIs |
| API key (header) | X-API-Key: ... header | most SaaS data APIs |
| OAuth 2.0 token | Authorization: Bearer ... | user-account & enterprise APIs |
import os
import requests
# Load secrets from the environment — NEVER write them in code
API_KEY = os.environ['WEATHER_API_KEY']
# 1) key as a query parameter
r1 = requests.get('https://api.weather.com/v1/now',
params={'city': 'Mumbai', 'apikey': API_KEY}, timeout=10)
# 2) OAuth bearer token in the Authorization header
headers = {'Authorization': 'Bearer ' + os.environ['ACCESS_TOKEN']}
r2 = requests.get('https://api.service.com/v2/data', headers=headers, timeout=10)
print(r1.status_code, r2.status_code)200 200
.env file, load them with the python-dotenv package, and add .env to .gitignore so they are never committed. A leaked key can cost you money or get your access revoked.- API key in a query param or header for simple APIs; OAuth
Bearertoken for user/enterprise APIs. - Send credentials with
params=orheaders=inrequests. - Store keys in environment variables / a
.gitignore-d.envfile — never in code or Git.
6Building a repeatable ingestion pipeline
A one-off pull is fine for exploring. For real work you want a small pipeline you can re-run any time and trust to produce the same clean result.
import requests, csv
from datetime import date
def fetch_rates(base='USD'):
url = 'https://api.exchangerate.host/latest'
resp = requests.get(url, params={'base': base}, timeout=10)
resp.raise_for_status() # fail loudly on errors
return resp.json()['rates']
def save_snapshot(rates):
stamp = date.today().isoformat() # e.g. 2024-01-15
path = 'rates_' + stamp + '.csv' # timestamped = never overwrite history
with open(path, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['currency', 'rate'])
for currency, rate in rates.items():
writer.writerow([currency, rate])
return path
saved = save_snapshot(fetch_rates('USD'))
print('Saved:', saved)Saved: rates_2024-01-15.csv
- Split ingestion into clear functions: extract → validate → store.
- Use
timeout=andraise_for_status()so failures are loud, not silent. - Store raw pulls with a timestamp and keep them immutable; schedule with cron / Task Scheduler.
★ Hands-on Project — Live Data Collector
Build a small program that collects live data from a public API and stores it for analysis — a real ingestion pipeline you can re-run any time.
- Pick a free public API that needs no login to start, e.g. exchangerate.host (currencies), Open-Meteo (weather) or CoinGecko (crypto prices).
- Write
collect.pywith afetch()function that calls the API withrequests.get(...), atimeout, andresp.raise_for_status(). - Parse the JSON with
resp.json()and pull out the fields you care about. - Save each run to a timestamped CSV (e.g.
data_2024-01-15.csv) so history is never overwritten. - Add basic validation: skip the save and print a warning if the response is empty or missing expected keys.
- If the API needs a key, load it from an environment variable (or a
.envfile) — do not hard-code it. - Load your saved file back with
pd.read_csv(...)and printdf.head()to confirm it is analysis-ready. - Bonus: schedule
collect.pyto run daily with cron or Windows Task Scheduler and commit the project to GitHub.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.