SQL is the most-requested skill in data-analyst job ads worldwide — because the data lives in databases, and SQL is how you ask it questions. You already pulled query results into Pandas in Module 2; now you learn to write the queries like a professional.
1The core query: SELECT, WHERE, GROUP BY, HAVING, ORDER BY
Every analytics query is built from the same clauses, and they always run in a fixed logical order.
SELECT region,
SUM(amount) AS total_sales,
COUNT(*) AS orders
FROM sales
WHERE order_date >= '2024-01-01' -- filter rows first
GROUP BY region -- collapse into groups
HAVING SUM(amount) > 3000 -- filter the groups
ORDER BY total_sales DESC; -- sort the result| region | total_sales | orders |
|---|---|---|
| North | 5300 | 4 |
| East | 4100 | 3 |
| West | 3800 | 3 |
WHERE filters individual rows before grouping; HAVING filters the groups after aggregation. Use WHERE for raw columns, HAVING for aggregates like SUM().- SELECT chooses columns, WHERE filters rows, GROUP BY aggregates, HAVING filters groups, ORDER BY sorts.
- Use WHERE for raw values and HAVING for aggregate conditions (e.g.
SUM() > 3000). - Alias columns with
ASfor readable, report-ready output.
2Joining tables
Analytics needs data from many tables. A JOIN combines rows from two tables on a shared key.
SELECT o.order_id,
c.name,
c.city,
o.amount
FROM orders AS o
LEFT JOIN customers AS c
ON o.customer_id = c.customer_id
ORDER BY o.amount DESC;| order_id | name | city | amount |
|---|---|---|---|
| 1003 | Maya | Pune | 1450 |
| 1001 | Asha | Mumbai | 1200 |
| 1002 | Rohan | Delhi | 980 |
| Join | Keeps |
|---|---|
INNER JOIN | only rows that match in both tables |
LEFT JOIN | all rows from the left table + matches |
RIGHT JOIN | all rows from the right table + matches |
FULL OUTER JOIN | all rows from both tables |
CROSS JOIN | every combination (use with care) |
orders AS o) and qualify columns (o.amount) so the query stays unambiguous and readable when joining several tables.- JOIN combines tables on a shared key; choose the type by which unmatched rows you want to keep.
- INNER = matches only; LEFT = everything from the left table plus matches.
- Alias tables and qualify column names to keep multi-table queries clear.
3Subqueries & Common Table Expressions (CTEs)
For multi-step logic, build the query in readable layers. A CTE (the WITH clause) is a named, temporary result you can reference — far cleaner than nested subqueries.
-- Which customers spend above the overall average?
WITH customer_spend AS (
SELECT customer_id, SUM(amount) AS total
FROM orders
GROUP BY customer_id
)
SELECT cs.customer_id, cs.total
FROM customer_spend AS cs
WHERE cs.total > (SELECT AVG(total) FROM customer_spend)
ORDER BY cs.total DESC;| customer_id | total |
|---|---|
| C-204 | 8450 |
| C-118 | 6320 |
| C-077 | 5110 |
- A CTE (
WITH name AS (...)) is a named, reusable intermediate result. - Subqueries embed one query inside another; CTEs make multi-step logic far more readable.
- Build complex analytics in clear layers rather than deep nesting.
4Window functions — the analyst's edge
Window functions compute across a set of rows related to the current row without collapsing them — running totals, rankings, period-over-period change. This is what separates intermediate from advanced SQL.
SELECT
region,
month,
revenue,
SUM(revenue) OVER (PARTITION BY region ORDER BY month) AS running_total,
RANK() OVER (PARTITION BY region ORDER BY revenue DESC) AS rev_rank,
LAG(revenue) OVER (PARTITION BY region ORDER BY month) AS prev_month
FROM monthly_sales;| region | month | revenue | running_total | rev_rank | prev_month |
|---|---|---|---|---|---|
| North | Jan | 1200 | 1200 | 3 | — |
| North | Feb | 1450 | 2650 | 1 | 1200 |
| North | Mar | 1300 | 3950 | 2 | 1450 |
| Function | Gives you |
|---|---|
ROW_NUMBER() | a unique sequential number |
RANK() / DENSE_RANK() | ranking (with/without gaps for ties) |
LAG() / LEAD() | previous / next row's value (for % change) |
NTILE(4) | split rows into quartiles |
PARTITION BY (the group) and ORDER BY (the order within it).- Window functions calculate across related rows without collapsing them (unlike GROUP BY).
PARTITION BYsets the group;ORDER BYsets the order within each window.- Use ROW_NUMBER/RANK for top-N, LAG/LEAD for period-over-period change, NTILE for buckets.
5CASE, conditional logic & functions
CASE adds if/then logic inside SQL — perfect for bucketing and labelling.
SELECT
order_id,
amount,
CASE
WHEN amount >= 2000 THEN 'High'
WHEN amount >= 1000 THEN 'Medium'
ELSE 'Low'
END AS value_band,
UPPER(region) AS region_code,
EXTRACT(YEAR FROM order_date) AS order_year
FROM orders;| order_id | amount | value_band | region_code | order_year |
|---|---|---|---|---|
| 1001 | 1200 | Medium | NORTH | 2024 |
| 1003 | 2450 | High | WEST | 2024 |
| 1002 | 980 | Low | SOUTH | 2024 |
UPPER, LOWER, TRIM, CONCAT, SUBSTRING; dates — EXTRACT, DATE_TRUNC, CURRENT_DATE; nulls — COALESCE(col, 0) to replace missing values. CASE combined with SUM() also gives you pivot-style conditional totals.CASE WHEN ... THEN ... ELSE ... ENDadds if/then logic and bucketing in SQL.- String/date functions clean and reshape values directly in the query.
COALESCEhandles NULLs; CASE + SUM gives conditional/pivot-style totals.
6Query optimisation & SQL from Python
Slow queries cost time and money. A few habits keep them fast.
- Select only the columns you need — avoid
SELECT *in production. - Filter early with
WHEREso less data flows through the query. - Index the columns you filter and join on most.
- Read the plan with
EXPLAINto see how the database will run your query.
EXPLAIN
SELECT region, SUM(amount)
FROM sales
WHERE order_date >= '2024-01-01'
GROUP BY region;Run SQL from your Python workflow
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///retail.db')
query = '''
SELECT region, SUM(amount) AS total
FROM sales
GROUP BY region
ORDER BY total DESC
'''
df = pd.read_sql(query, engine) # SQL result -> DataFrame, ready to chart
print(df.head())region total 0 North 5300.0 1 East 4100.0 2 West 3800.0
- Speed habits: select only needed columns, filter early, index join/filter keys, read the EXPLAIN plan.
pd.read_sql(query, engine)brings query results straight into Pandas.- Aggregate in SQL, then visualise in Python — fast where it counts, flexible where it helps.
★ Hands-on Project — Retail Database Deep Dive
Work through a realistic retail database and answer real business questions with SQL, finishing with a Python-powered report built on your query output.
- Set up a SQLite database (or use the provided one) with
sales,customersandproductstables. - Warm up: top 10 products by revenue using
GROUP BY+ORDER BY+LIMIT. - Join orders to customers and products to report revenue by region and category.
- Use a CTE to find customers who spend above the overall average.
- Use window functions for 'top 3 products per category' (
RANK() OVER (PARTITION BY ...)) and a monthly running total. - Add a
CASEstatement to bucket orders into Low/Medium/High value bands and count each. - Run your best query from Python with
pd.read_sqland build one chart from the result. - Save the queries (.sql) and the notebook to GitHub with short notes on each insight.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.