🔮 Module 9

Predictive Analytics Fundamentals

⏱ 10 hoursAdvanced6 topics
🎯 By the end: build and interpret a regression forecast, train a simple classifier, evaluate models honestly with the right metrics, read feature importance, and forecast a time series — all framed for business decisions, not maths exams.

So far you have described what happened. Predictive analytics asks what is likely to happen next — and helps the business act early. This module is the bridge from analytics to machine learning, taught for analysts: light on equations, heavy on interpretation and business value.

Scope note: this is the fundamentals for analysts. Deep machine learning — complex models, tuning, neural networks — belongs to a dedicated Data Science course. Here we build just enough to forecast and classify confidently.

1What is predictive modelling?

A predictive model learns a pattern from past data (where you know the answer) and applies it to new data (where you do not).

Features (X)ad spend, tenure, season…Modellearns from historyPrediction (ŷ)next month's sales…
Inputs (features) flow through a trained model to produce a prediction.
TypePredictsBusiness example
Regressiona numbernext quarter's revenue
Classificationa categorywill this customer churn? (yes/no)
Never test on data the model has seen. Always split into a training set (to learn) and a test set (to judge). Scoring on training data flatters the model and hides how it will really perform.
Key points
  • A model learns a pattern from labelled history and applies it to new data.
  • Regression predicts a number; classification predicts a category.
  • Always split train vs test — judge a model only on data it has never seen.

2Linear regression for forecasting

Linear regression fits the best straight line through your data — ideal for forecasting a number from one or more drivers.

from sklearn.linear_model import LinearRegression

X = df[['ad_spend']]      # feature(s) — note the double brackets (2-D)
y = df['sales']           # target

model = LinearRegression().fit(X, y)
print('Slope    :', round(model.coef_[0], 3))
print('Intercept:', round(model.intercept_, 1))
print('R-squared:', round(model.score(X, y), 3))

# forecast sales for a planned ad spend of 50,000
forecast = model.predict([[50000]])
print('Forecast :', round(forecast[0], 0))
▶ Output
Slope    : 4.812
Intercept: 9800.0
R-squared: 0.871
Forecast : 250400.0
Sales vs ad spendAd spend →
The fitted line (orange) captures the upward trend; new spend → predicted sales.
Read it in business terms: a slope of 4.81 means each extra ₹1 of ad spend is associated with about ₹4.81 more sales. R-squared = 0.87 means the model explains 87% of the variation in sales — a strong fit. (R² near 0 = weak, near 1 = strong.)
Key points
  • Linear regression fits the best line to forecast a number from one or more features.
  • The slope (coefficient) is the effect of one unit of a feature on the target.
  • R-squared (0→1) measures how much of the variation the model explains.

3Classification: predicting categories

When the answer is a category (churn / not churn, fraud / legit), use classification. Logistic regression and decision trees are the analyst's go-to starters.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X = df[['tenure', 'monthly_charges', 'support_calls']]
y = df['churned']        # 1 = left, 0 = stayed

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000).fit(X_train, y_train)
print('Test accuracy:', round(model.score(X_test, y_test), 3))

# probability each test customer will churn
proba = model.predict_proba(X_test)[:, 1]
print('First 3 churn probabilities:', proba[:3].round(2))
▶ Output
Test accuracy: 0.842
First 3 churn probabilities: [0.08 0.71 0.23]
Decision trees are wonderfully explainable. A tree reads like a flowchart of yes/no rules (“if support_calls > 4 and tenure < 6 months → likely to churn”). For stakeholders who need to understand the model, trees beat black boxes.
Key points
  • Classification predicts a category; logistic regression and decision trees are great starters.
  • predict_proba gives the probability of each class — often more useful than a hard yes/no.
  • Decision trees produce human-readable rules, ideal for explaining decisions to the business.

4Evaluating models honestly

How good is the model? Use the right metric for the job — and never trust accuracy alone.

For…UseMeans
RegressionRMSE / MAEaverage size of the prediction error
Regressionshare of variation explained
Classificationprecisionof those predicted positive, how many were right
Classificationrecallof all real positives, how many we caught
ClassificationF1balance of precision and recall

A confusion matrix shows exactly where a classifier is right and wrong:

PredictedNegativePositiveActualNegPosTNcorrectFPfalse alarmFNmissedTPcorrect
Green = correct predictions; red = the two kinds of mistake (false alarms and misses).
Accuracy can lie. If only 2% of customers churn, a model that predicts “nobody churns” is 98% accurate — and useless. On imbalanced problems, precision, recall and F1 tell the truth.
Key points
  • Regression: judge with RMSE/MAE (error size) and R² (variation explained).
  • Classification: precision (right when positive), recall (caught the positives), F1 (their balance).
  • On imbalanced data, accuracy is misleading — read the confusion matrix and precision/recall.

5Feature importance & business translation

A model's real value is telling you which factors matter most — so the business can act on them.

import pandas as pd
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=4).fit(X_train, y_train)

importance = pd.Series(tree.feature_importances_, index=X.columns)
print(importance.sort_values(ascending=False).round(3))
▶ Output
support_calls     0.512
tenure            0.318
monthly_charges   0.170
dtype: float64

Turn the model into a recommendation

Model outputBusiness translation
support_calls is the top driver of churnCustomers who contact support a lot are flight risks — trigger a proactive call after the 3rd ticket.
short tenure increases churn riskStrengthen onboarding in the first 90 days.
The analyst's value-add: anyone can run .fit(). You are paid to translate “feature importance 0.51” into “here is what we should do, and the expected impact.” Always close the loop from model → recommendation.
Key points
  • Coefficients (linear/logistic) and feature_importances_ (trees) reveal what drives the outcome.
  • Rank the drivers, then translate each into a concrete business action.
  • The model's job is to inform a decision — always end with a recommendation.

6Time-series forecasting

For data over time (daily sales, monthly demand), specialised methods capture trend and seasonality.

Start simple: moving average & exponential smoothing

import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing

ts = df.set_index('date')['sales']

moving_avg = ts.rolling(window=3).mean()      # smooths short-term noise

model = ExponentialSmoothing(ts, trend='add', seasonal='add',
                             seasonal_periods=12).fit()
forecast = model.forecast(3)                  # next 3 months
print(forecast.round(0))
▶ Output
2024-07-31    46100.0
2024-08-31    47550.0
2024-09-30    49200.0
Freq: M, Name: sales, dtype: float64
Sales: history + forecasthistoryforecast
Solid teal = actuals; dashed orange = the model's forecast for the coming months.
Decomposition (seasonal_decompose) splits a series into trend, seasonality and residual — a great first look that explains why a series moves the way it does.
Simple, explainable, monitored. A modest forecast you understand and re-check each month beats a complex one nobody trusts. Always show a forecast with its uncertainty, and compare predictions to what actually happens.
Key points
  • Time-series methods capture trend and seasonality that plain regression misses.
  • Moving averages smooth noise; exponential smoothing/Holt-Winters forecast forward.
  • Decomposition explains a series as trend + seasonality + residual; always show forecast uncertainty.

★ Hands-on Project — Sales Forecasting Model

Build a regression-based sales forecast for a retail business, evaluate it honestly, and deliver a plain-English business report.

  1. Load historical sales with at least one driver (e.g. ad spend, month, promotions).
  2. Split the data into training and test sets with train_test_split.
  3. Fit a LinearRegression model and report the slope(s), intercept and R-squared.
  4. Evaluate on the test set with RMSE/MAE and interpret the typical error size in business terms.
  5. Forecast the next period and show a confidence range (not just a single number).
  6. Identify which feature matters most and translate it into one concrete recommendation.
  7. Bonus: build a time-series forecast (exponential smoothing) and plot history + forecast.
  8. Write a one-page report (forecast, accuracy, assumptions, recommendation) and push it to GitHub.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.