📥 Module 2

Data Collection & Ingestion

⏱ 10 hoursBeginner–Intermediate6 topics
🎯 By the end: load data from CSV/Excel files, query a SQL database from Python, scrape a web page responsibly, pull live data from a REST API with authentication, and wrap it all in a small, repeatable ingestion pipeline.

Before you can analyse anything, you have to get the data in. In the real world it is scattered everywhere — a spreadsheet from finance, a database the engineers own, a web page with no download button, a live API that updates every minute. This module gives you a reliable way to pull from every one of those sources into Python.

1Reading files with Pandas: CSV, Excel & flat files

In Module 1 you read a CSV the manual way to understand it. From here on we use Pandas — one line loads a file into a DataFrame, a powerful in-memory table. (We only load here; cleaning comes in Module 3.)

import pandas as pd

# CSV — the analyst's everyday format
sales = pd.read_csv('sales.csv')

# Excel — choose a specific sheet
budget = pd.read_excel('budget.xlsx', sheet_name='Q1')

print(sales.head())
print('Shape:', sales.shape)
▶ Output
         date region  amount
0  2024-01-01  North    1200
1  2024-01-01  South     980
2  2024-01-02  North    1450
Shape: (3, 3)

Read smarter with options

read_csv has dozens of options. These four solve 90% of messy imports:

orders = pd.read_csv(
    'orders.csv',
    usecols=['date', 'region', 'amount'],   # only the columns you need
    parse_dates=['date'],                    # read dates as real dates
    na_values=['', 'N/A', 'NA', '-'],        # treat these as missing
    thousands=',',                           # 1,200 -> 1200
)
print(orders.dtypes)
▶ Output
date      datetime64[ns]
region            object
amount             int64
dtype: object
Excel needs a helper: read_excel requires the openpyxl package — install it once with pip install openpyxl. For tab-separated files use pd.read_csv('file.tsv', sep='\t').
Key points
  • pd.read_csv() and pd.read_excel() load a file into a DataFrame in one line.
  • Use usecols, parse_dates, na_values and thousands to import cleanly the first time.
  • df.head(), df.shape and df.dtypes are your first three checks on any new dataset.

2Pulling data from SQL databases

Most company data lives in a relational database. You connect with a connection URL, run a query, and Pandas hands you a DataFrame. The same code works for SQLite, MySQL and PostgreSQL — only the URL changes.

DatabaseSQLAlchemy connection URL
SQLite (a local file)sqlite:///company.db
MySQLmysql+pymysql://user:pass@host/db
PostgreSQLpostgresql+psycopg2://user:pass@host/db
import pandas as pd
from sqlalchemy import create_engine

# SQLite needs no server — perfect for learning
engine = create_engine('sqlite:///company.db')

query = '''
    SELECT region, SUM(amount) AS total
    FROM sales
    GROUP BY region
    ORDER BY total DESC
'''

df = pd.read_sql(query, engine)
print(df)
▶ Output
  region   total
0  North  2650.0
1  South   980.0

You push the heavy lifting (filtering, grouping) to the database and pull back only the summary — fast and scalable. You will write serious SQL in Module 7.

Never hard-code credentials. Read them from environment variables so passwords never end up in your code or on GitHub:

import os
url = ('postgresql+psycopg2://' + os.environ['DB_USER'] + ':'
       + os.environ['DB_PASS'] + '@db.example.com/analytics')
engine = create_engine(url)
Key points
  • A connection URL + create_engine() connects Python to any SQL database.
  • pd.read_sql(query, engine) returns query results straight into a DataFrame.
  • Let the database do the filtering/grouping; keep secrets in environment variables, not in code.

3Web scraping responsibly with Requests & BeautifulSoup

When there is no download and no API, you can read the data straight from a web page. Requests fetches the HTML; BeautifulSoup lets you pluck out the bits you want.

import requests
from bs4 import BeautifulSoup

resp = requests.get('https://example.com/products', timeout=10)
resp.raise_for_status()                  # stop if the page failed to load

soup = BeautifulSoup(resp.text, 'html.parser')

rows = []
for card in soup.select('.product'):     # a CSS selector
    rows.append({
        'name': card.select_one('.name').text.strip(),
        'price': card.select_one('.price').text.strip(),
    })

print(rows[:2])
▶ Output
[{'name': 'Mechanical Keyboard', 'price': '799'},
 {'name': 'Wireless Mouse', 'price': '349'}]

You build a list of dictionaries (one per record), which Pandas turns into a table instantly: pd.DataFrame(rows).

Scrape ethically and legally. Always: read the site's robots.txt and Terms of Service; scrape gently (add a delay, do not hammer the server); identify yourself with a User-Agent; never collect personal data. If an API exists, use it instead — it is faster, allowed and more stable.
Key points
  • requests.get(url) downloads a page; BeautifulSoup parses it so you can select elements.
  • Collect records as a list of dicts, then pd.DataFrame(rows) for a table.
  • Scraping carries legal/ethical duties — respect robots.txt, rate-limit, and prefer an official API.

4Consuming REST APIs & parsing JSON

An API is a clean, official doorway to live data. You send an HTTP request; the server replies with JSON. This is the most reliable way to collect fresh data.

Your Python scriptrequests.get(...)API serverreturns JSONGET request + params/keyJSON response (data)
A request goes out; structured JSON data comes back.
import requests
import pandas as pd

url = 'https://api.exchangerate.host/latest'
resp = requests.get(url, params={'base': 'USD', 'symbols': 'INR,EUR,GBP'}, timeout=10)
resp.raise_for_status()

data = resp.json()           # JSON -> Python dict
print(resp.status_code)
print(data['rates'])

# flatten nested JSON straight into a table
table = pd.json_normalize(data)
print(table[['base', 'rates.INR', 'rates.EUR']])
▶ Output
200
{'INR': 83.2, 'EUR': 0.92, 'GBP': 0.79}
  base  rates.INR  rates.EUR
0  USD       83.2       0.92
Production habits: always check resp.status_code (200 = OK) or call resp.raise_for_status(); respect the API's rate limits; and handle pagination (many APIs return data in pages — loop until there are no more).
Key points
  • APIs return JSON; resp.json() turns it into a Python dict/list.
  • pd.json_normalize() flattens nested JSON into a tidy DataFrame.
  • Always check status codes, respect rate limits, and follow pagination to get all the data.

5Authentication: API keys & OAuth tokens

Most useful APIs need to know who is calling. The two common schemes are a simple API key and an OAuth bearer token.

MethodHow it is sentTypical use
API key (param)in the query string: ?apikey=...simple public-data APIs
API key (header)X-API-Key: ... headermost SaaS data APIs
OAuth 2.0 tokenAuthorization: Bearer ...user-account & enterprise APIs
import os
import requests

# Load secrets from the environment — NEVER write them in code
API_KEY = os.environ['WEATHER_API_KEY']

# 1) key as a query parameter
r1 = requests.get('https://api.weather.com/v1/now',
                  params={'city': 'Mumbai', 'apikey': API_KEY}, timeout=10)

# 2) OAuth bearer token in the Authorization header
headers = {'Authorization': 'Bearer ' + os.environ['ACCESS_TOKEN']}
r2 = requests.get('https://api.service.com/v2/data', headers=headers, timeout=10)

print(r1.status_code, r2.status_code)
▶ Output
200 200
Protect your keys. Keep them in a .env file, load them with the python-dotenv package, and add .env to .gitignore so they are never committed. A leaked key can cost you money or get your access revoked.
Key points
  • API key in a query param or header for simple APIs; OAuth Bearer token for user/enterprise APIs.
  • Send credentials with params= or headers= in requests.
  • Store keys in environment variables / a .gitignore-d .env file — never in code or Git.

6Building a repeatable ingestion pipeline

A one-off pull is fine for exploring. For real work you want a small pipeline you can re-run any time and trust to produce the same clean result.

1Source2Extract3Validate4Store raw5Schedule
Each pull follows the same five steps — easy to automate and audit.
import requests, csv
from datetime import date

def fetch_rates(base='USD'):
    url = 'https://api.exchangerate.host/latest'
    resp = requests.get(url, params={'base': base}, timeout=10)
    resp.raise_for_status()              # fail loudly on errors
    return resp.json()['rates']

def save_snapshot(rates):
    stamp = date.today().isoformat()      # e.g. 2024-01-15
    path = 'rates_' + stamp + '.csv'      # timestamped = never overwrite history
    with open(path, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['currency', 'rate'])
        for currency, rate in rates.items():
            writer.writerow([currency, rate])
    return path

saved = save_snapshot(fetch_rates('USD'))
print('Saved:', saved)
▶ Output
Saved: rates_2024-01-15.csv
Keep raw data immutable. Save each pull with a timestamp and never edit it — do your cleaning on a copy. That way you can always reproduce an old analysis. To run this automatically, schedule the script with cron (Linux/Mac) or Task Scheduler (Windows).
Key points
  • Split ingestion into clear functions: extract → validate → store.
  • Use timeout= and raise_for_status() so failures are loud, not silent.
  • Store raw pulls with a timestamp and keep them immutable; schedule with cron / Task Scheduler.

★ Hands-on Project — Live Data Collector

Build a small program that collects live data from a public API and stores it for analysis — a real ingestion pipeline you can re-run any time.

  1. Pick a free public API that needs no login to start, e.g. exchangerate.host (currencies), Open-Meteo (weather) or CoinGecko (crypto prices).
  2. Write collect.py with a fetch() function that calls the API with requests.get(...), a timeout, and resp.raise_for_status().
  3. Parse the JSON with resp.json() and pull out the fields you care about.
  4. Save each run to a timestamped CSV (e.g. data_2024-01-15.csv) so history is never overwritten.
  5. Add basic validation: skip the save and print a warning if the response is empty or missing expected keys.
  6. If the API needs a key, load it from an environment variable (or a .env file) — do not hard-code it.
  7. Load your saved file back with pd.read_csv(...) and print df.head() to confirm it is analysis-ready.
  8. Bonus: schedule collect.py to run daily with cron or Windows Task Scheduler and commit the project to GitHub.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.