🗄️ Module 7

SQL for Analysts

⏱ 12 hoursIntermediate6 topics
🎯 By the end: write SELECT queries with filtering, grouping and sorting; join multiple tables; structure complex logic with subqueries and CTEs; rank and run totals with window functions; and integrate SQL into a Python analytics workflow.

SQL is the most-requested skill in data-analyst job ads worldwide — because the data lives in databases, and SQL is how you ask it questions. You already pulled query results into Pandas in Module 2; now you learn to write the queries like a professional.

1The core query: SELECT, WHERE, GROUP BY, HAVING, ORDER BY

Every analytics query is built from the same clauses, and they always run in a fixed logical order.

SELECT   region,
         SUM(amount) AS total_sales,
         COUNT(*)    AS orders
FROM     sales
WHERE    order_date >= '2024-01-01'      -- filter rows first
GROUP BY region                          -- collapse into groups
HAVING   SUM(amount) > 3000              -- filter the groups
ORDER BY total_sales DESC;               -- sort the result
regiontotal_salesorders
North53004
East41003
West38003
WHERE vs HAVING: WHERE filters individual rows before grouping; HAVING filters the groups after aggregation. Use WHERE for raw columns, HAVING for aggregates like SUM().
Key points
  • SELECT chooses columns, WHERE filters rows, GROUP BY aggregates, HAVING filters groups, ORDER BY sorts.
  • Use WHERE for raw values and HAVING for aggregate conditions (e.g. SUM() > 3000).
  • Alias columns with AS for readable, report-ready output.

2Joining tables

Analytics needs data from many tables. A JOIN combines rows from two tables on a shared key.

SELECT   o.order_id,
         c.name,
         c.city,
         o.amount
FROM     orders   AS o
LEFT JOIN customers AS c
       ON o.customer_id = c.customer_id
ORDER BY o.amount DESC;
order_idnamecityamount
1003MayaPune1450
1001AshaMumbai1200
1002RohanDelhi980
JoinKeeps
INNER JOINonly rows that match in both tables
LEFT JOINall rows from the left table + matches
RIGHT JOINall rows from the right table + matches
FULL OUTER JOINall rows from both tables
CROSS JOINevery combination (use with care)
Alias your tables (orders AS o) and qualify columns (o.amount) so the query stays unambiguous and readable when joining several tables.
Key points
  • JOIN combines tables on a shared key; choose the type by which unmatched rows you want to keep.
  • INNER = matches only; LEFT = everything from the left table plus matches.
  • Alias tables and qualify column names to keep multi-table queries clear.

3Subqueries & Common Table Expressions (CTEs)

For multi-step logic, build the query in readable layers. A CTE (the WITH clause) is a named, temporary result you can reference — far cleaner than nested subqueries.

-- Which customers spend above the overall average?
WITH customer_spend AS (
    SELECT customer_id, SUM(amount) AS total
    FROM   orders
    GROUP BY customer_id
)
SELECT cs.customer_id, cs.total
FROM   customer_spend AS cs
WHERE  cs.total > (SELECT AVG(total) FROM customer_spend)
ORDER BY cs.total DESC;
customer_idtotal
C-2048450
C-1186320
C-0775110
CTE vs subquery: both work, but a CTE reads top-to-bottom like a recipe and can be reused multiple times in the same query. For anything beyond one level of nesting, prefer CTEs.
Key points
  • A CTE (WITH name AS (...)) is a named, reusable intermediate result.
  • Subqueries embed one query inside another; CTEs make multi-step logic far more readable.
  • Build complex analytics in clear layers rather than deep nesting.

4Window functions — the analyst's edge

Window functions compute across a set of rows related to the current row without collapsing them — running totals, rankings, period-over-period change. This is what separates intermediate from advanced SQL.

SELECT
    region,
    month,
    revenue,
    SUM(revenue) OVER (PARTITION BY region ORDER BY month)        AS running_total,
    RANK()       OVER (PARTITION BY region ORDER BY revenue DESC) AS rev_rank,
    LAG(revenue) OVER (PARTITION BY region ORDER BY month)        AS prev_month
FROM monthly_sales;
regionmonthrevenuerunning_totalrev_rankprev_month
NorthJan120012003
NorthFeb1450265011200
NorthMar1300395021450
FunctionGives you
ROW_NUMBER()a unique sequential number
RANK() / DENSE_RANK()ranking (with/without gaps for ties)
LAG() / LEAD()previous / next row's value (for % change)
NTILE(4)split rows into quartiles
Why it matters: “top 3 products per category”, “month-over-month growth” and “customer deciles” are trivial with window functions and painful without them. Master PARTITION BY (the group) and ORDER BY (the order within it).
Key points
  • Window functions calculate across related rows without collapsing them (unlike GROUP BY).
  • PARTITION BY sets the group; ORDER BY sets the order within each window.
  • Use ROW_NUMBER/RANK for top-N, LAG/LEAD for period-over-period change, NTILE for buckets.

5CASE, conditional logic & functions

CASE adds if/then logic inside SQL — perfect for bucketing and labelling.

SELECT
    order_id,
    amount,
    CASE
        WHEN amount >= 2000 THEN 'High'
        WHEN amount >= 1000 THEN 'Medium'
        ELSE 'Low'
    END AS value_band,
    UPPER(region)              AS region_code,
    EXTRACT(YEAR FROM order_date) AS order_year
FROM orders;
order_idamountvalue_bandregion_codeorder_year
10011200MediumNORTH2024
10032450HighWEST2024
1002980LowSOUTH2024
Handy functions: string — UPPER, LOWER, TRIM, CONCAT, SUBSTRING; dates — EXTRACT, DATE_TRUNC, CURRENT_DATE; nulls — COALESCE(col, 0) to replace missing values. CASE combined with SUM() also gives you pivot-style conditional totals.
Key points
  • CASE WHEN ... THEN ... ELSE ... END adds if/then logic and bucketing in SQL.
  • String/date functions clean and reshape values directly in the query.
  • COALESCE handles NULLs; CASE + SUM gives conditional/pivot-style totals.

6Query optimisation & SQL from Python

Slow queries cost time and money. A few habits keep them fast.

  • Select only the columns you need — avoid SELECT * in production.
  • Filter early with WHERE so less data flows through the query.
  • Index the columns you filter and join on most.
  • Read the plan with EXPLAIN to see how the database will run your query.
EXPLAIN
SELECT region, SUM(amount)
FROM   sales
WHERE  order_date >= '2024-01-01'
GROUP BY region;

Run SQL from your Python workflow

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('sqlite:///retail.db')

query = '''
    SELECT region, SUM(amount) AS total
    FROM   sales
    GROUP BY region
    ORDER BY total DESC
'''
df = pd.read_sql(query, engine)   # SQL result -> DataFrame, ready to chart
print(df.head())
▶ Output
  region   total
0  North  5300.0
1   East  4100.0
2   West  3800.0
Best of both worlds: do the joining, filtering and aggregating in SQL (fast, on the server), then hand the small, clean result to Pandas for visualisation and reporting.
Key points
  • Speed habits: select only needed columns, filter early, index join/filter keys, read the EXPLAIN plan.
  • pd.read_sql(query, engine) brings query results straight into Pandas.
  • Aggregate in SQL, then visualise in Python — fast where it counts, flexible where it helps.

★ Hands-on Project — Retail Database Deep Dive

Work through a realistic retail database and answer real business questions with SQL, finishing with a Python-powered report built on your query output.

  1. Set up a SQLite database (or use the provided one) with sales, customers and products tables.
  2. Warm up: top 10 products by revenue using GROUP BY + ORDER BY + LIMIT.
  3. Join orders to customers and products to report revenue by region and category.
  4. Use a CTE to find customers who spend above the overall average.
  5. Use window functions for 'top 3 products per category' (RANK() OVER (PARTITION BY ...)) and a monthly running total.
  6. Add a CASE statement to bucket orders into Low/Medium/High value bands and count each.
  7. Run your best query from Python with pd.read_sql and build one chart from the result.
  8. Save the queries (.sql) and the notebook to GitHub with short notes on each insight.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.