Deep learning is behind the breakthroughs you have heard about — image recognition, voice assistants, the models that power chatbots. At its core it is surprisingly simple: stacks of the dot product from Module 1, separated by little nonlinear functions, trained by the gradient descent from Module 1. That is genuinely most of it. This module builds your intuition from a single neuron up to convolutional networks for images, then shows the professional shortcut — transfer learning — that lets you stand on the shoulders of models trained on millions of images. We keep the maths light and the code runnable.
1From a neuron to a neural network
A single neuron does three things: multiply each input by a weight, add them up (plus a bias), and pass the result through an activation function. That is just a dot product followed by a squashing step.
Activation functions add the nonlinearity
Without a nonlinear activation, stacking layers would just be one big linear model. ReLU (max(0, x)) is the modern default for hidden layers; sigmoid/softmax turn outputs into probabilities.
- A neuron computes a weighted sum (a dot product) plus bias, then a nonlinear activation.
- Activations like ReLU add nonlinearity; without them, deep layers collapse to one linear model.
- A neural network is layers of neurons stacked; 'deep' means several hidden layers.
2How networks learn: forward pass & backpropagation
Training is a loop of two passes. The forward pass runs inputs through the network to a prediction and computes the loss. Backpropagation then uses calculus (the chain rule) to compute how much each weight contributed to the error, and gradient descent nudges every weight to reduce it.
# The training loop, in plain pseudocode
for epoch in range(num_epochs):
for batch in data:
preds = model(batch.x) # forward pass
loss = loss_fn(preds, batch.y) # how wrong?
loss.backward() # backprop: compute gradients
optimizer.step() # gradient descent: update weights
optimizer.zero_grad() # reset for next batch- Forward pass: inputs → prediction → loss. Backward pass: gradients of the loss w.r.t. every weight.
- Backpropagation is the chain rule done efficiently; the update is gradient descent at scale.
- Frameworks (PyTorch/TensorFlow) compute gradients automatically via autodiff.
3Build your first network with Keras
Keras (inside TensorFlow) is the friendliest way to build networks: stack layers, compile, fit. Let us classify handwritten digits (MNIST) — the “hello world” of deep learning.
Define, compile, train
import tensorflow as tf
from tensorflow import keras
(X_tr, y_tr), (X_te, y_te) = keras.datasets.mnist.load_data()
X_tr, X_te = X_tr / 255.0, X_te / 255.0 # scale pixels to 0-1
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)), # 28x28 -> 784
keras.layers.Dense(128, activation='relu'),
keras.layers.Dropout(0.2),
keras.layers.Dense(10, activation='softmax'), # 10 digit classes
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(X_tr, y_tr, epochs=5, validation_split=0.1)Epoch 1/5 loss: 0.2934 accuracy: 0.9145 val_accuracy: 0.9598 Epoch 2/5 loss: 0.1421 accuracy: 0.9578 val_accuracy: 0.9680 Epoch 3/5 loss: 0.1064 accuracy: 0.9676 val_accuracy: 0.9722 Epoch 4/5 loss: 0.0875 accuracy: 0.9728 val_accuracy: 0.9745 Epoch 5/5 loss: 0.0739 accuracy: 0.9772 val_accuracy: 0.9763
Evaluate on the test set
test_loss, test_acc = model.evaluate(X_te, y_te, verbose=0)
print(f'Test accuracy: {test_acc:.4f}')Test accuracy: 0.9771
About 98% accuracy on unseen handwritten digits, from a model you can read top to bottom. Watch val_accuracy during training — if it stops improving while training accuracy keeps climbing, you are overfitting.
- Keras workflow:
Sequentialstack of layers →compile(optimizer + loss) →fit. - Scale inputs (e.g. pixels to 0-1); use softmax + cross-entropy for multi-class output.
- Track validation accuracy to catch overfitting as training proceeds.
4Training tricks that make networks work
Deep networks are powerful but finicky. A handful of techniques turn a model that will not train into one that works.
| Technique | Problem it solves |
|---|---|
| ReLU activations | vanishing gradients in deep nets |
| Adam optimiser | adapts the learning rate per weight |
| Dropout | overfitting (randomly mutes neurons) |
| Batch normalisation | unstable, slow training |
| Early stopping | training too long / overfitting |
| Learning-rate schedules | getting stuck or overshooting |
Callbacks: stop early, keep the best
from tensorflow import keras
early = keras.callbacks.EarlyStopping(
monitor='val_loss', patience=3, restore_best_weights=True)
model.fit(X_tr, y_tr, epochs=50, validation_split=0.1,
callbacks=[early]) # stops automatically when val_loss plateaus- ReLU + Adam are sensible defaults; dropout and batch norm stabilise and regularise.
- Early stopping halts training when validation loss plateaus and restores the best weights.
- Epochs, batch size and learning rate are the core dials — watch the train/validation gap.
5Convolutional Neural Networks for images
For images, a plain dense network ignores spatial structure. Convolutional Neural Networks (CNNs) slide small filters across the image to detect edges, then textures, then shapes — building understanding hierarchically.
from tensorflow import keras
cnn = keras.Sequential([
keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
keras.layers.MaxPooling2D(),
keras.layers.Conv2D(64, 3, activation='relu'),
keras.layers.MaxPooling2D(),
keras.layers.Flatten(),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dense(10, activation='softmax'),
])
cnn.compile(optimizer='adam',
loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# cnn.fit(...) reaches ~99% on MNISTTest accuracy: 0.9912
- CNNs slide small filters over images to detect features, preserving spatial structure.
- Conv + pooling layers shrink the spatial size and deepen features toward a classifier.
- Early layers learn edges, deeper layers learn whole objects — a feature hierarchy.
6Transfer learning: stand on giants' shoulders
Training a strong image model from scratch needs millions of images and serious compute. Transfer learning reuses a network already trained on a huge dataset (like ImageNet) and adapts it to your task with a tiny amount of data — the single most practical deep-learning technique.
Reuse a pretrained backbone
from tensorflow import keras
# 1. Load a pretrained model, drop its classifier head, freeze its weights
base = keras.applications.MobileNetV2(
input_shape=(160, 160, 3), include_top=False, weights='imagenet')
base.trainable = False
# 2. Add a small head for YOUR classes (e.g. cats vs dogs)
model = keras.Sequential([
base,
keras.layers.GlobalAveragePooling2D(),
keras.layers.Dropout(0.2),
keras.layers.Dense(1, activation='sigmoid'),
])
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
# A few epochs reaches strong accuracy on a few thousand imagesEpoch 1/3 accuracy: 0.9123 val_accuracy: 0.9620 Epoch 2/3 accuracy: 0.9588 val_accuracy: 0.9740 Epoch 3/3 accuracy: 0.9701 val_accuracy: 0.9780
The frozen backbone already knows edges, textures and shapes from a million photos; you only teach the small head to map those features to your labels. Fine-tuning — unfreezing the top layers later at a tiny learning rate — can squeeze out a little more.
- Transfer learning adapts a model pretrained on huge data to your task with little data.
- Freeze the pretrained backbone, train a small new head, then optionally fine-tune top layers.
- It is the default for image (and text) tasks — far cheaper than training from scratch.
★ Hands-on Project — Image Classifier with Transfer Learning
Train a real image classifier two ways and compare — building intuition for when deep learning earns its keep.
- Start in Google Colab with a free GPU (Runtime → Change runtime type → GPU).
- Load a small image dataset (e.g. cats vs dogs, or Fashion-MNIST) and scale/augment the images.
- Baseline: build and train a simple CNN from scratch; record test accuracy and training time.
- Transfer learning: load a pretrained backbone (MobileNetV2 or EfficientNet), freeze it, add a small head, and train.
- Compare the two on accuracy and training time — note how much less data/compute transfer learning needs.
- Add dropout and EarlyStopping; plot training vs validation accuracy and diagnose any overfitting.
- Fine-tune: unfreeze the top few backbone layers at a low learning rate and measure the change.
- Show a few predictions (with the image) including at least one mistake, then commit the notebook to your portfolio.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.