🧠 Module 7

Deep Learning

⏱ 18 hoursAdvanced6 topics

🎯 By the end: explain how a neural network computes and learns, build and train networks with Keras, apply the training tricks that make them work, construct a CNN for image classification, and use transfer learning to get strong results with little data.

Deep learning is behind the breakthroughs you have heard about — image recognition, voice assistants, the models that power chatbots. At its core it is surprisingly simple: stacks of the dot product from Module 1, separated by little nonlinear functions, trained by the gradient descent from Module 1. That is genuinely most of it. This module builds your intuition from a single neuron up to convolutional networks for images, then shows the professional shortcut — transfer learning — that lets you stand on the shoulders of models trained on millions of images. We keep the maths light and the code runnable.

1From a neuron to a neural network

A single neuron does three things: multiply each input by a weight, add them up (plus a bias), and pass the result through an activation function. That is just a dot product followed by a squashing step.

One neuron: weighted sum of inputs (+ bias) passed through a nonlinear activation.

Activation functions add the nonlinearity

Without a nonlinear activation, stacking layers would just be one big linear model. ReLU (max(0, x)) is the modern default for hidden layers; sigmoid/softmax turn outputs into probabilities.

ReLU passes positives unchanged and zeros negatives; sigmoid squashes any value into (0, 1).

A network is layers of neurons. Connect many neurons into layers, stack the layers, and you have a neural network. “Deep” just means several hidden layers. Each layer learns increasingly abstract features of the input.

Key points

A neuron computes a weighted sum (a dot product) plus bias, then a nonlinear activation.
Activations like ReLU add nonlinearity; without them, deep layers collapse to one linear model.
A neural network is layers of neurons stacked; 'deep' means several hidden layers.

2How networks learn: forward pass & backpropagation

Training is a loop of two passes. The forward pass runs inputs through the network to a prediction and computes the loss. Backpropagation then uses calculus (the chain rule) to compute how much each weight contributed to the error, and gradient descent nudges every weight to reduce it.

Predict forward, measure the loss, send gradients backward, update weights — repeat.

# The training loop, in plain pseudocode
for epoch in range(num_epochs):
    for batch in data:
        preds = model(batch.x)            # forward pass
        loss  = loss_fn(preds, batch.y)   # how wrong?
        loss.backward()                   # backprop: compute gradients
        optimizer.step()                  # gradient descent: update weights
        optimizer.zero_grad()             # reset for next batch

You already understand this. Backprop is just the chain rule applied efficiently, and the weight update is the gradient descent from Module 1 — now applied to millions of weights at once. Frameworks compute all the gradients for you automatically (“autodiff”).

Key points

Forward pass: inputs → prediction → loss. Backward pass: gradients of the loss w.r.t. every weight.
Backpropagation is the chain rule done efficiently; the update is gradient descent at scale.
Frameworks (PyTorch/TensorFlow) compute gradients automatically via autodiff.

3Build your first network with Keras

Keras (inside TensorFlow) is the friendliest way to build networks: stack layers, compile, fit. Let us classify handwritten digits (MNIST) — the “hello world” of deep learning.

Define, compile, train

import tensorflow as tf
from tensorflow import keras

(X_tr, y_tr), (X_te, y_te) = keras.datasets.mnist.load_data()
X_tr, X_te = X_tr / 255.0, X_te / 255.0      # scale pixels to 0-1

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),   # 28x28 -> 784
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(10, activation='softmax'),  # 10 digit classes
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(X_tr, y_tr, epochs=5, validation_split=0.1)

▶ Output (training)

Epoch 1/5  loss: 0.2934  accuracy: 0.9145  val_accuracy: 0.9598
Epoch 2/5  loss: 0.1421  accuracy: 0.9578  val_accuracy: 0.9680
Epoch 3/5  loss: 0.1064  accuracy: 0.9676  val_accuracy: 0.9722
Epoch 4/5  loss: 0.0875  accuracy: 0.9728  val_accuracy: 0.9745
Epoch 5/5  loss: 0.0739  accuracy: 0.9772  val_accuracy: 0.9763

Evaluate on the test set

test_loss, test_acc = model.evaluate(X_te, y_te, verbose=0)
print(f'Test accuracy: {test_acc:.4f}')

▶ Output

Test accuracy: 0.9771

About 98% accuracy on unseen handwritten digits, from a model you can read top to bottom. Watch val_accuracy during training — if it stops improving while training accuracy keeps climbing, you are overfitting.

Key points

Keras workflow: Sequential stack of layers → compile (optimizer + loss) → fit.
Scale inputs (e.g. pixels to 0-1); use softmax + cross-entropy for multi-class output.
Track validation accuracy to catch overfitting as training proceeds.

4Training tricks that make networks work

Deep networks are powerful but finicky. A handful of techniques turn a model that will not train into one that works.

Technique	Problem it solves
ReLU activations	vanishing gradients in deep nets
Adam optimiser	adapts the learning rate per weight
Dropout	overfitting (randomly mutes neurons)
Batch normalisation	unstable, slow training
Early stopping	training too long / overfitting
Learning-rate schedules	getting stuck or overshooting

Callbacks: stop early, keep the best

from tensorflow import keras

early = keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=3, restore_best_weights=True)

model.fit(X_tr, y_tr, epochs=50, validation_split=0.1,
          callbacks=[early])      # stops automatically when val_loss plateaus

Epochs, batches & learning rate — the three dials. An epoch is one full pass over the data; a batch is the chunk processed before each weight update; the learning rate sets the step size. These three, plus the tricks above, are what you spend tuning time on.

Watch the gap, not just the loss. If training loss keeps falling but validation loss starts rising, the network is memorising. Dropout, early stopping and more data are your first responses — exactly the overfitting cures from Module 5, applied to deep nets.

Key points

ReLU + Adam are sensible defaults; dropout and batch norm stabilise and regularise.
Early stopping halts training when validation loss plateaus and restores the best weights.
Epochs, batch size and learning rate are the core dials — watch the train/validation gap.

5Convolutional Neural Networks for images

For images, a plain dense network ignores spatial structure. Convolutional Neural Networks (CNNs) slide small filters across the image to detect edges, then textures, then shapes — building understanding hierarchically.

Convolution + pooling shrink the spatial size while deepening features, ending in a classifier.

from tensorflow import keras

cnn = keras.Sequential([
    keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
    keras.layers.MaxPooling2D(),
    keras.layers.Conv2D(64, 3, activation='relu'),
    keras.layers.MaxPooling2D(),
    keras.layers.Flatten(),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(10, activation='softmax'),
])
cnn.compile(optimizer='adam',
            loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# cnn.fit(...) reaches ~99% on MNIST

▶ Output (after training)

Test accuracy: 0.9912

CNNs see hierarchy. Early filters learn edges; middle layers learn parts (an eye, a wheel); deep layers learn whole objects. This is why a CNN beats a dense net on images — it respects that nearby pixels belong together.

Key points

CNNs slide small filters over images to detect features, preserving spatial structure.
Conv + pooling layers shrink the spatial size and deepen features toward a classifier.
Early layers learn edges, deeper layers learn whole objects — a feature hierarchy.

6Transfer learning: stand on giants' shoulders

Training a strong image model from scratch needs millions of images and serious compute. Transfer learning reuses a network already trained on a huge dataset (like ImageNet) and adapts it to your task with a tiny amount of data — the single most practical deep-learning technique.

Reuse a pretrained backbone

from tensorflow import keras

# 1. Load a pretrained model, drop its classifier head, freeze its weights
base = keras.applications.MobileNetV2(
    input_shape=(160, 160, 3), include_top=False, weights='imagenet')
base.trainable = False

# 2. Add a small head for YOUR classes (e.g. cats vs dogs)
model = keras.Sequential([
    base,
    keras.layers.GlobalAveragePooling2D(),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(1, activation='sigmoid'),
])
model.compile(optimizer='adam', loss='binary_crossentropy',
              metrics=['accuracy'])
# A few epochs reaches strong accuracy on a few thousand images

▶ Output (training the new head)

Epoch 1/3  accuracy: 0.9123  val_accuracy: 0.9620
Epoch 2/3  accuracy: 0.9588  val_accuracy: 0.9740
Epoch 3/3  accuracy: 0.9701  val_accuracy: 0.9780

The frozen backbone already knows edges, textures and shapes from a million photos; you only teach the small head to map those features to your labels. Fine-tuning — unfreezing the top layers later at a tiny learning rate — can squeeze out a little more.

Default to transfer learning for images and text. Modern NLP works the same way (Module 8): start from a pretrained transformer and adapt it. Training from scratch is for research budgets; adapting a pretrained model is for getting results this week.

Key points

Transfer learning adapts a model pretrained on huge data to your task with little data.
Freeze the pretrained backbone, train a small new head, then optionally fine-tune top layers.
It is the default for image (and text) tasks — far cheaper than training from scratch.

★ Hands-on Project — Image Classifier with Transfer Learning

Train a real image classifier two ways and compare — building intuition for when deep learning earns its keep.

Start in Google Colab with a free GPU (Runtime → Change runtime type → GPU).
Load a small image dataset (e.g. cats vs dogs, or Fashion-MNIST) and scale/augment the images.
Baseline: build and train a simple CNN from scratch; record test accuracy and training time.
Transfer learning: load a pretrained backbone (MobileNetV2 or EfficientNet), freeze it, add a small head, and train.
Compare the two on accuracy and training time — note how much less data/compute transfer learning needs.
Add dropout and EarlyStopping; plot training vs validation accuracy and diagnose any overfitting.
Fine-tune: unfreeze the top few backbone layers at a low learning rate and measure the change.
Show a few predictions (with the image) including at least one mistake, then commit the notebook to your portfolio.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.

← Previous

Advanced & Unsupervised Learning

Natural Language Processing