📥 Module 2

Data Collection & Ingestion

⏱ 10 hoursBeginner–Intermediate6 topics

🎯 By the end: load data from CSV/Excel files, query a SQL database from Python, scrape a web page responsibly, pull live data from a REST API with authentication, and wrap it all in a small, repeatable ingestion pipeline.

Before you can analyse anything, you have to get the data in. In the real world it is scattered everywhere — a spreadsheet from finance, a database the engineers own, a web page with no download button, a live API that updates every minute. This module gives you a reliable way to pull from every one of those sources into Python.

1Reading files with Pandas: CSV, Excel & flat files

In Module 1 you read a CSV the manual way to understand it. From here on we use Pandas — one line loads a file into a DataFrame, a powerful in-memory table. (We only load here; cleaning comes in Module 3.)

import pandas as pd

# CSV — the analyst's everyday format
sales = pd.read_csv('sales.csv')

# Excel — choose a specific sheet
budget = pd.read_excel('budget.xlsx', sheet_name='Q1')

print(sales.head())
print('Shape:', sales.shape)

▶ Output

         date region  amount
0  2024-01-01  North    1200
1  2024-01-01  South     980
2  2024-01-02  North    1450
Shape: (3, 3)

Read smarter with options

read_csv has dozens of options. These four solve 90% of messy imports:

orders = pd.read_csv(
    'orders.csv',
    usecols=['date', 'region', 'amount'],   # only the columns you need
    parse_dates=['date'],                    # read dates as real dates
    na_values=['', 'N/A', 'NA', '-'],        # treat these as missing
    thousands=',',                           # 1,200 -> 1200
)
print(orders.dtypes)

▶ Output

date      datetime64[ns]
region            object
amount             int64
dtype: object

Excel needs a helper: read_excel requires the openpyxl package — install it once with pip install openpyxl. For tab-separated files use pd.read_csv('file.tsv', sep='\t').

Key points

pd.read_csv() and pd.read_excel() load a file into a DataFrame in one line.
Use usecols, parse_dates, na_values and thousands to import cleanly the first time.
df.head(), df.shape and df.dtypes are your first three checks on any new dataset.

2Pulling data from SQL databases

Most company data lives in a relational database. You connect with a connection URL, run a query, and Pandas hands you a DataFrame. The same code works for SQLite, MySQL and PostgreSQL — only the URL changes.

Database	SQLAlchemy connection URL
SQLite (a local file)	`sqlite:///company.db`
MySQL	`mysql+pymysql://user:pass@host/db`
PostgreSQL	`postgresql+psycopg2://user:pass@host/db`

import pandas as pd
from sqlalchemy import create_engine

# SQLite needs no server — perfect for learning
engine = create_engine('sqlite:///company.db')

query = '''
    SELECT region, SUM(amount) AS total
    FROM sales
    GROUP BY region
    ORDER BY total DESC
'''

df = pd.read_sql(query, engine)
print(df)

▶ Output

  region   total
0  North  2650.0
1  South   980.0

You push the heavy lifting (filtering, grouping) to the database and pull back only the summary — fast and scalable. You will write serious SQL in Module 7.

Never hard-code credentials. Read them from environment variables so passwords never end up in your code or on GitHub:

import os
url = ('postgresql+psycopg2://' + os.environ['DB_USER'] + ':'
       + os.environ['DB_PASS'] + '@db.example.com/analytics')
engine = create_engine(url)

Key points

A connection URL + create_engine() connects Python to any SQL database.
pd.read_sql(query, engine) returns query results straight into a DataFrame.
Let the database do the filtering/grouping; keep secrets in environment variables, not in code.

3Web scraping responsibly with Requests & BeautifulSoup

When there is no download and no API, you can read the data straight from a web page. Requests fetches the HTML; BeautifulSoup lets you pluck out the bits you want.

import requests
from bs4 import BeautifulSoup

resp = requests.get('https://example.com/products', timeout=10)
resp.raise_for_status()                  # stop if the page failed to load

soup = BeautifulSoup(resp.text, 'html.parser')

rows = []
for card in soup.select('.product'):     # a CSS selector
    rows.append({
        'name': card.select_one('.name').text.strip(),
        'price': card.select_one('.price').text.strip(),
    })

print(rows[:2])

▶ Output

[{'name': 'Mechanical Keyboard', 'price': '799'},
 {'name': 'Wireless Mouse', 'price': '349'}]

You build a list of dictionaries (one per record), which Pandas turns into a table instantly: pd.DataFrame(rows).

Scrape ethically and legally. Always: read the site's robots.txt and Terms of Service; scrape gently (add a delay, do not hammer the server); identify yourself with a User-Agent; never collect personal data. If an API exists, use it instead — it is faster, allowed and more stable.

Key points

requests.get(url) downloads a page; BeautifulSoup parses it so you can select elements.
Collect records as a list of dicts, then pd.DataFrame(rows) for a table.
Scraping carries legal/ethical duties — respect robots.txt, rate-limit, and prefer an official API.

4Consuming REST APIs & parsing JSON

An API is a clean, official doorway to live data. You send an HTTP request; the server replies with JSON. This is the most reliable way to collect fresh data.

A request goes out; structured JSON data comes back.

import requests
import pandas as pd

url = 'https://api.exchangerate.host/latest'
resp = requests.get(url, params={'base': 'USD', 'symbols': 'INR,EUR,GBP'}, timeout=10)
resp.raise_for_status()

data = resp.json()           # JSON -> Python dict
print(resp.status_code)
print(data['rates'])

# flatten nested JSON straight into a table
table = pd.json_normalize(data)
print(table[['base', 'rates.INR', 'rates.EUR']])

▶ Output

200
{'INR': 83.2, 'EUR': 0.92, 'GBP': 0.79}
  base  rates.INR  rates.EUR
0  USD       83.2       0.92

Production habits: always check resp.status_code (200 = OK) or call resp.raise_for_status(); respect the API's rate limits; and handle pagination (many APIs return data in pages — loop until there are no more).

Key points

APIs return JSON; resp.json() turns it into a Python dict/list.
pd.json_normalize() flattens nested JSON into a tidy DataFrame.
Always check status codes, respect rate limits, and follow pagination to get all the data.

5Authentication: API keys & OAuth tokens

Most useful APIs need to know who is calling. The two common schemes are a simple API key and an OAuth bearer token.

Method	How it is sent	Typical use
API key (param)	in the query string: `?apikey=...`	simple public-data APIs
API key (header)	`X-API-Key: ...` header	most SaaS data APIs
OAuth 2.0 token	`Authorization: Bearer ...`	user-account & enterprise APIs

import os
import requests

# Load secrets from the environment — NEVER write them in code
API_KEY = os.environ['WEATHER_API_KEY']

# 1) key as a query parameter
r1 = requests.get('https://api.weather.com/v1/now',
                  params={'city': 'Mumbai', 'apikey': API_KEY}, timeout=10)

# 2) OAuth bearer token in the Authorization header
headers = {'Authorization': 'Bearer ' + os.environ['ACCESS_TOKEN']}
r2 = requests.get('https://api.service.com/v2/data', headers=headers, timeout=10)

print(r1.status_code, r2.status_code)

▶ Output

200 200

Protect your keys. Keep them in a .env file, load them with the python-dotenv package, and add .env to .gitignore so they are never committed. A leaked key can cost you money or get your access revoked.

Key points

API key in a query param or header for simple APIs; OAuth Bearer token for user/enterprise APIs.
Send credentials with params= or headers= in requests.
Store keys in environment variables / a .gitignore-d .env file — never in code or Git.

6Building a repeatable ingestion pipeline

A one-off pull is fine for exploring. For real work you want a small pipeline you can re-run any time and trust to produce the same clean result.

Each pull follows the same five steps — easy to automate and audit.

import requests, csv
from datetime import date

def fetch_rates(base='USD'):
    url = 'https://api.exchangerate.host/latest'
    resp = requests.get(url, params={'base': base}, timeout=10)
    resp.raise_for_status()              # fail loudly on errors
    return resp.json()['rates']

def save_snapshot(rates):
    stamp = date.today().isoformat()      # e.g. 2024-01-15
    path = 'rates_' + stamp + '.csv'      # timestamped = never overwrite history
    with open(path, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['currency', 'rate'])
        for currency, rate in rates.items():
            writer.writerow([currency, rate])
    return path

saved = save_snapshot(fetch_rates('USD'))
print('Saved:', saved)

▶ Output

Saved: rates_2024-01-15.csv

Keep raw data immutable. Save each pull with a timestamp and never edit it — do your cleaning on a copy. That way you can always reproduce an old analysis. To run this automatically, schedule the script with cron (Linux/Mac) or Task Scheduler (Windows).

Key points

Split ingestion into clear functions: extract → validate → store.
Use timeout= and raise_for_status() so failures are loud, not silent.
Store raw pulls with a timestamp and keep them immutable; schedule with cron / Task Scheduler.

★ Hands-on Project — Live Data Collector

Build a small program that collects live data from a public API and stores it for analysis — a real ingestion pipeline you can re-run any time.

Pick a free public API that needs no login to start, e.g. exchangerate.host (currencies), Open-Meteo (weather) or CoinGecko (crypto prices).
Write collect.py with a fetch() function that calls the API with requests.get(...), a timeout, and resp.raise_for_status().
Parse the JSON with resp.json() and pull out the fields you care about.
Save each run to a timestamped CSV (e.g. data_2024-01-15.csv) so history is never overwritten.
Add basic validation: skip the save and print a warning if the response is empty or missing expected keys.
If the API needs a key, load it from an environment variable (or a .env file) — do not hard-code it.
Load your saved file back with pd.read_csv(...) and print df.head() to confirm it is analysis-ready.
Bonus: schedule collect.py to run daily with cron or Windows Task Scheduler and commit the project to GitHub.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.

← Previous

Python Foundations for Analytics

Data Wrangling with Pandas