So far you have described what happened. Predictive analytics asks what is likely to happen next — and helps the business act early. This module is the bridge from analytics to machine learning, taught for analysts: light on equations, heavy on interpretation and business value.
1What is predictive modelling?
A predictive model learns a pattern from past data (where you know the answer) and applies it to new data (where you do not).
| Type | Predicts | Business example |
|---|---|---|
| Regression | a number | next quarter's revenue |
| Classification | a category | will this customer churn? (yes/no) |
- A model learns a pattern from labelled history and applies it to new data.
- Regression predicts a number; classification predicts a category.
- Always split train vs test — judge a model only on data it has never seen.
2Linear regression for forecasting
Linear regression fits the best straight line through your data — ideal for forecasting a number from one or more drivers.
from sklearn.linear_model import LinearRegression
X = df[['ad_spend']] # feature(s) — note the double brackets (2-D)
y = df['sales'] # target
model = LinearRegression().fit(X, y)
print('Slope :', round(model.coef_[0], 3))
print('Intercept:', round(model.intercept_, 1))
print('R-squared:', round(model.score(X, y), 3))
# forecast sales for a planned ad spend of 50,000
forecast = model.predict([[50000]])
print('Forecast :', round(forecast[0], 0))Slope : 4.812 Intercept: 9800.0 R-squared: 0.871 Forecast : 250400.0
- Linear regression fits the best line to forecast a number from one or more features.
- The slope (coefficient) is the effect of one unit of a feature on the target.
- R-squared (0→1) measures how much of the variation the model explains.
3Classification: predicting categories
When the answer is a category (churn / not churn, fraud / legit), use classification. Logistic regression and decision trees are the analyst's go-to starters.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X = df[['tenure', 'monthly_charges', 'support_calls']]
y = df['churned'] # 1 = left, 0 = stayed
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000).fit(X_train, y_train)
print('Test accuracy:', round(model.score(X_test, y_test), 3))
# probability each test customer will churn
proba = model.predict_proba(X_test)[:, 1]
print('First 3 churn probabilities:', proba[:3].round(2))Test accuracy: 0.842 First 3 churn probabilities: [0.08 0.71 0.23]
- Classification predicts a category; logistic regression and decision trees are great starters.
predict_probagives the probability of each class — often more useful than a hard yes/no.- Decision trees produce human-readable rules, ideal for explaining decisions to the business.
4Evaluating models honestly
How good is the model? Use the right metric for the job — and never trust accuracy alone.
| For… | Use | Means |
|---|---|---|
| Regression | RMSE / MAE | average size of the prediction error |
| Regression | R² | share of variation explained |
| Classification | precision | of those predicted positive, how many were right |
| Classification | recall | of all real positives, how many we caught |
| Classification | F1 | balance of precision and recall |
A confusion matrix shows exactly where a classifier is right and wrong:
- Regression: judge with RMSE/MAE (error size) and R² (variation explained).
- Classification: precision (right when positive), recall (caught the positives), F1 (their balance).
- On imbalanced data, accuracy is misleading — read the confusion matrix and precision/recall.
5Feature importance & business translation
A model's real value is telling you which factors matter most — so the business can act on them.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=4).fit(X_train, y_train)
importance = pd.Series(tree.feature_importances_, index=X.columns)
print(importance.sort_values(ascending=False).round(3))support_calls 0.512 tenure 0.318 monthly_charges 0.170 dtype: float64
Turn the model into a recommendation
| Model output | Business translation |
|---|---|
| support_calls is the top driver of churn | Customers who contact support a lot are flight risks — trigger a proactive call after the 3rd ticket. |
| short tenure increases churn risk | Strengthen onboarding in the first 90 days. |
.fit(). You are paid to translate “feature importance 0.51” into “here is what we should do, and the expected impact.” Always close the loop from model → recommendation.- Coefficients (linear/logistic) and
feature_importances_(trees) reveal what drives the outcome. - Rank the drivers, then translate each into a concrete business action.
- The model's job is to inform a decision — always end with a recommendation.
6Time-series forecasting
For data over time (daily sales, monthly demand), specialised methods capture trend and seasonality.
Start simple: moving average & exponential smoothing
import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing
ts = df.set_index('date')['sales']
moving_avg = ts.rolling(window=3).mean() # smooths short-term noise
model = ExponentialSmoothing(ts, trend='add', seasonal='add',
seasonal_periods=12).fit()
forecast = model.forecast(3) # next 3 months
print(forecast.round(0))2024-07-31 46100.0 2024-08-31 47550.0 2024-09-30 49200.0 Freq: M, Name: sales, dtype: float64
seasonal_decompose) splits a series into trend, seasonality and residual — a great first look that explains why a series moves the way it does.- Time-series methods capture trend and seasonality that plain regression misses.
- Moving averages smooth noise; exponential smoothing/Holt-Winters forecast forward.
- Decomposition explains a series as trend + seasonality + residual; always show forecast uncertainty.
★ Hands-on Project — Sales Forecasting Model
Build a regression-based sales forecast for a retail business, evaluate it honestly, and deliver a plain-English business report.
- Load historical sales with at least one driver (e.g. ad spend, month, promotions).
- Split the data into training and test sets with
train_test_split. - Fit a
LinearRegressionmodel and report the slope(s), intercept and R-squared. - Evaluate on the test set with RMSE/MAE and interpret the typical error size in business terms.
- Forecast the next period and show a confidence range (not just a single number).
- Identify which feature matters most and translate it into one concrete recommendation.
- Bonus: build a time-series forecast (exponential smoothing) and plot history + forecast.
- Write a one-page report (forecast, accuracy, assumptions, recommendation) and push it to GitHub.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.