🔮 Module 9

Predictive Analytics Fundamentals

⏱ 10 hoursAdvanced6 topics

🎯 By the end: build and interpret a regression forecast, train a simple classifier, evaluate models honestly with the right metrics, read feature importance, and forecast a time series — all framed for business decisions, not maths exams.

So far you have described what happened. Predictive analytics asks what is likely to happen next — and helps the business act early. This module is the bridge from analytics to machine learning, taught for analysts: light on equations, heavy on interpretation and business value.

Scope note: this is the fundamentals for analysts. Deep machine learning — complex models, tuning, neural networks — belongs to a dedicated Data Science course. Here we build just enough to forecast and classify confidently.

1What is predictive modelling?

A predictive model learns a pattern from past data (where you know the answer) and applies it to new data (where you do not).

Inputs (features) flow through a trained model to produce a prediction.

Type	Predicts	Business example
Regression	a number	next quarter's revenue
Classification	a category	will this customer churn? (yes/no)

Never test on data the model has seen. Always split into a training set (to learn) and a test set (to judge). Scoring on training data flatters the model and hides how it will really perform.

Key points

A model learns a pattern from labelled history and applies it to new data.
Regression predicts a number; classification predicts a category.
Always split train vs test — judge a model only on data it has never seen.

2Linear regression for forecasting

Linear regression fits the best straight line through your data — ideal for forecasting a number from one or more drivers.

from sklearn.linear_model import LinearRegression

X = df[['ad_spend']]      # feature(s) — note the double brackets (2-D)
y = df['sales']           # target

model = LinearRegression().fit(X, y)
print('Slope    :', round(model.coef_[0], 3))
print('Intercept:', round(model.intercept_, 1))
print('R-squared:', round(model.score(X, y), 3))

# forecast sales for a planned ad spend of 50,000
forecast = model.predict([[50000]])
print('Forecast :', round(forecast[0], 0))

▶ Output

Slope    : 4.812
Intercept: 9800.0
R-squared: 0.871
Forecast : 250400.0

The fitted line (orange) captures the upward trend; new spend → predicted sales.

Read it in business terms: a slope of 4.81 means each extra ₹1 of ad spend is associated with about ₹4.81 more sales. R-squared = 0.87 means the model explains 87% of the variation in sales — a strong fit. (R² near 0 = weak, near 1 = strong.)

Key points

Linear regression fits the best line to forecast a number from one or more features.
The slope (coefficient) is the effect of one unit of a feature on the target.
R-squared (0→1) measures how much of the variation the model explains.

3Classification: predicting categories

When the answer is a category (churn / not churn, fraud / legit), use classification. Logistic regression and decision trees are the analyst's go-to starters.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X = df[['tenure', 'monthly_charges', 'support_calls']]
y = df['churned']        # 1 = left, 0 = stayed

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000).fit(X_train, y_train)
print('Test accuracy:', round(model.score(X_test, y_test), 3))

# probability each test customer will churn
proba = model.predict_proba(X_test)[:, 1]
print('First 3 churn probabilities:', proba[:3].round(2))

▶ Output

Test accuracy: 0.842
First 3 churn probabilities: [0.08 0.71 0.23]

Decision trees are wonderfully explainable. A tree reads like a flowchart of yes/no rules (“if support_calls > 4 and tenure < 6 months → likely to churn”). For stakeholders who need to understand the model, trees beat black boxes.

Key points

Classification predicts a category; logistic regression and decision trees are great starters.
predict_proba gives the probability of each class — often more useful than a hard yes/no.
Decision trees produce human-readable rules, ideal for explaining decisions to the business.

4Evaluating models honestly

How good is the model? Use the right metric for the job — and never trust accuracy alone.

For…	Use	Means
Regression	`RMSE` / `MAE`	average size of the prediction error
Regression	`R²`	share of variation explained
Classification	`precision`	of those predicted positive, how many were right
Classification	`recall`	of all real positives, how many we caught
Classification	`F1`	balance of precision and recall

A confusion matrix shows exactly where a classifier is right and wrong:

Green = correct predictions; red = the two kinds of mistake (false alarms and misses).

Accuracy can lie. If only 2% of customers churn, a model that predicts “nobody churns” is 98% accurate — and useless. On imbalanced problems, precision, recall and F1 tell the truth.

Key points

Regression: judge with RMSE/MAE (error size) and R² (variation explained).
Classification: precision (right when positive), recall (caught the positives), F1 (their balance).
On imbalanced data, accuracy is misleading — read the confusion matrix and precision/recall.

5Feature importance & business translation

A model's real value is telling you which factors matter most — so the business can act on them.

import pandas as pd
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=4).fit(X_train, y_train)

importance = pd.Series(tree.feature_importances_, index=X.columns)
print(importance.sort_values(ascending=False).round(3))

▶ Output

support_calls     0.512
tenure            0.318
monthly_charges   0.170
dtype: float64

Turn the model into a recommendation

Model output	Business translation
support_calls is the top driver of churn	Customers who contact support a lot are flight risks — trigger a proactive call after the 3rd ticket.
short tenure increases churn risk	Strengthen onboarding in the first 90 days.

The analyst's value-add: anyone can run .fit(). You are paid to translate “feature importance 0.51” into “here is what we should do, and the expected impact.” Always close the loop from model → recommendation.

Key points

Coefficients (linear/logistic) and feature_importances_ (trees) reveal what drives the outcome.
Rank the drivers, then translate each into a concrete business action.
The model's job is to inform a decision — always end with a recommendation.

6Time-series forecasting

For data over time (daily sales, monthly demand), specialised methods capture trend and seasonality.

Start simple: moving average & exponential smoothing

import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing

ts = df.set_index('date')['sales']

moving_avg = ts.rolling(window=3).mean()      # smooths short-term noise

model = ExponentialSmoothing(ts, trend='add', seasonal='add',
                             seasonal_periods=12).fit()
forecast = model.forecast(3)                  # next 3 months
print(forecast.round(0))

▶ Output

2024-07-31    46100.0
2024-08-31    47550.0
2024-09-30    49200.0
Freq: M, Name: sales, dtype: float64

Solid teal = actuals; dashed orange = the model's forecast for the coming months.

Decomposition (seasonal_decompose) splits a series into trend, seasonality and residual — a great first look that explains why a series moves the way it does.

Simple, explainable, monitored. A modest forecast you understand and re-check each month beats a complex one nobody trusts. Always show a forecast with its uncertainty, and compare predictions to what actually happens.

Key points

Time-series methods capture trend and seasonality that plain regression misses.
Moving averages smooth noise; exponential smoothing/Holt-Winters forecast forward.
Decomposition explains a series as trend + seasonality + residual; always show forecast uncertainty.

★ Hands-on Project — Sales Forecasting Model

Build a regression-based sales forecast for a retail business, evaluate it honestly, and deliver a plain-English business report.

Load historical sales with at least one driver (e.g. ad spend, month, promotions).
Split the data into training and test sets with train_test_split.
Fit a LinearRegression model and report the slope(s), intercept and R-squared.
Evaluate on the test set with RMSE/MAE and interpret the typical error size in business terms.
Forecast the next period and show a confidence range (not just a single number).
Identify which feature matters most and translate it into one concrete recommendation.
Bonus: build a time-series forecast (exponential smoothing) and plot history + forecast.
Write a one-page report (forecast, accuracy, assumptions, recommendation) and push it to GitHub.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.

← Previous

Business Intelligence & Reporting

Capstone Prep & Career Readiness