🗄️ Module 7

SQL for Analysts

⏱ 12 hoursIntermediate6 topics

🎯 By the end: write SELECT queries with filtering, grouping and sorting; join multiple tables; structure complex logic with subqueries and CTEs; rank and run totals with window functions; and integrate SQL into a Python analytics workflow.

SQL is the most-requested skill in data-analyst job ads worldwide — because the data lives in databases, and SQL is how you ask it questions. You already pulled query results into Pandas in Module 2; now you learn to write the queries like a professional.

1The core query: SELECT, WHERE, GROUP BY, HAVING, ORDER BY

Every analytics query is built from the same clauses, and they always run in a fixed logical order.

SELECT   region,
         SUM(amount) AS total_sales,
         COUNT(*)    AS orders
FROM     sales
WHERE    order_date >= '2024-01-01'      -- filter rows first
GROUP BY region                          -- collapse into groups
HAVING   SUM(amount) > 3000              -- filter the groups
ORDER BY total_sales DESC;               -- sort the result

region	total_sales	orders
North	5300	4
East	4100	3
West	3800	3

WHERE vs HAVING: WHERE filters individual rows before grouping; HAVING filters the groups after aggregation. Use WHERE for raw columns, HAVING for aggregates like SUM().

Key points

SELECT chooses columns, WHERE filters rows, GROUP BY aggregates, HAVING filters groups, ORDER BY sorts.
Use WHERE for raw values and HAVING for aggregate conditions (e.g. SUM() > 3000).
Alias columns with AS for readable, report-ready output.

2Joining tables

Analytics needs data from many tables. A JOIN combines rows from two tables on a shared key.

SELECT   o.order_id,
         c.name,
         c.city,
         o.amount
FROM     orders   AS o
LEFT JOIN customers AS c
       ON o.customer_id = c.customer_id
ORDER BY o.amount DESC;

order_id	name	city	amount
1003	Maya	Pune	1450
1001	Asha	Mumbai	1200
1002	Rohan	Delhi	980

Join	Keeps
`INNER JOIN`	only rows that match in both tables
`LEFT JOIN`	all rows from the left table + matches
`RIGHT JOIN`	all rows from the right table + matches
`FULL OUTER JOIN`	all rows from both tables
`CROSS JOIN`	every combination (use with care)

Alias your tables (orders AS o) and qualify columns (o.amount) so the query stays unambiguous and readable when joining several tables.

Key points

JOIN combines tables on a shared key; choose the type by which unmatched rows you want to keep.
INNER = matches only; LEFT = everything from the left table plus matches.
Alias tables and qualify column names to keep multi-table queries clear.

3Subqueries & Common Table Expressions (CTEs)

For multi-step logic, build the query in readable layers. A CTE (the WITH clause) is a named, temporary result you can reference — far cleaner than nested subqueries.

-- Which customers spend above the overall average?
WITH customer_spend AS (
    SELECT customer_id, SUM(amount) AS total
    FROM   orders
    GROUP BY customer_id
)
SELECT cs.customer_id, cs.total
FROM   customer_spend AS cs
WHERE  cs.total > (SELECT AVG(total) FROM customer_spend)
ORDER BY cs.total DESC;

customer_id	total
C-204	8450
C-118	6320
C-077	5110

CTE vs subquery: both work, but a CTE reads top-to-bottom like a recipe and can be reused multiple times in the same query. For anything beyond one level of nesting, prefer CTEs.

Key points

A CTE (WITH name AS (...)) is a named, reusable intermediate result.
Subqueries embed one query inside another; CTEs make multi-step logic far more readable.
Build complex analytics in clear layers rather than deep nesting.

4Window functions — the analyst's edge

Window functions compute across a set of rows related to the current row without collapsing them — running totals, rankings, period-over-period change. This is what separates intermediate from advanced SQL.

SELECT
    region,
    month,
    revenue,
    SUM(revenue) OVER (PARTITION BY region ORDER BY month)        AS running_total,
    RANK()       OVER (PARTITION BY region ORDER BY revenue DESC) AS rev_rank,
    LAG(revenue) OVER (PARTITION BY region ORDER BY month)        AS prev_month
FROM monthly_sales;

region	month	revenue	running_total	rev_rank	prev_month
North	Jan	1200	1200	3	—
North	Feb	1450	2650	1	1200
North	Mar	1300	3950	2	1450

Function	Gives you
`ROW_NUMBER()`	a unique sequential number
`RANK() / DENSE_RANK()`	ranking (with/without gaps for ties)
`LAG() / LEAD()`	previous / next row's value (for % change)
`NTILE(4)`	split rows into quartiles

Why it matters: “top 3 products per category”, “month-over-month growth” and “customer deciles” are trivial with window functions and painful without them. Master PARTITION BY (the group) and ORDER BY (the order within it).

Key points

Window functions calculate across related rows without collapsing them (unlike GROUP BY).
PARTITION BY sets the group; ORDER BY sets the order within each window.
Use ROW_NUMBER/RANK for top-N, LAG/LEAD for period-over-period change, NTILE for buckets.

5CASE, conditional logic & functions

CASE adds if/then logic inside SQL — perfect for bucketing and labelling.

SELECT
    order_id,
    amount,
    CASE
        WHEN amount >= 2000 THEN 'High'
        WHEN amount >= 1000 THEN 'Medium'
        ELSE 'Low'
    END AS value_band,
    UPPER(region)              AS region_code,
    EXTRACT(YEAR FROM order_date) AS order_year
FROM orders;

order_id	amount	value_band	region_code	order_year
1001	1200	Medium	NORTH	2024
1003	2450	High	WEST	2024
1002	980	Low	SOUTH	2024

Handy functions: string — UPPER, LOWER, TRIM, CONCAT, SUBSTRING; dates — EXTRACT, DATE_TRUNC, CURRENT_DATE; nulls — COALESCE(col, 0) to replace missing values. CASE combined with SUM() also gives you pivot-style conditional totals.

Key points

CASE WHEN ... THEN ... ELSE ... END adds if/then logic and bucketing in SQL.
String/date functions clean and reshape values directly in the query.
COALESCE handles NULLs; CASE + SUM gives conditional/pivot-style totals.

6Query optimisation & SQL from Python

Slow queries cost time and money. A few habits keep them fast.

Select only the columns you need — avoid SELECT * in production.
Filter early with WHERE so less data flows through the query.
Index the columns you filter and join on most.
Read the plan with EXPLAIN to see how the database will run your query.

EXPLAIN
SELECT region, SUM(amount)
FROM   sales
WHERE  order_date >= '2024-01-01'
GROUP BY region;

Run SQL from your Python workflow

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('sqlite:///retail.db')

query = '''
    SELECT region, SUM(amount) AS total
    FROM   sales
    GROUP BY region
    ORDER BY total DESC
'''
df = pd.read_sql(query, engine)   # SQL result -> DataFrame, ready to chart
print(df.head())

▶ Output

  region   total
0  North  5300.0
1   East  4100.0
2   West  3800.0

Best of both worlds: do the joining, filtering and aggregating in SQL (fast, on the server), then hand the small, clean result to Pandas for visualisation and reporting.

Key points

Speed habits: select only needed columns, filter early, index join/filter keys, read the EXPLAIN plan.
pd.read_sql(query, engine) brings query results straight into Pandas.
Aggregate in SQL, then visualise in Python — fast where it counts, flexible where it helps.

★ Hands-on Project — Retail Database Deep Dive

Work through a realistic retail database and answer real business questions with SQL, finishing with a Python-powered report built on your query output.

Set up a SQLite database (or use the provided one) with sales, customers and products tables.
Warm up: top 10 products by revenue using GROUP BY + ORDER BY + LIMIT.
Join orders to customers and products to report revenue by region and category.
Use a CTE to find customers who spend above the overall average.
Use window functions for 'top 3 products per category' (RANK() OVER (PARTITION BY ...)) and a monthly running total.
Add a CASE statement to bucket orders into Low/Medium/High value bands and count each.
Run your best query from Python with pd.read_sql and build one chart from the result.
Save the queries (.sql) and the notebook to GitHub with short notes on each insight.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.

← Previous

Statistics for Data Analytics

Business Intelligence & Reporting