💬 Module 8

Natural Language Processing

⏱ 14 hoursAdvanced6 topics

🎯 By the end: clean and tokenise text, turn it into numeric features with TF-IDF and embeddings, train a text classifier, explain how transformers and attention work, and use pretrained Hugging Face models for real NLP tasks.

Most of the world's data is text — reviews, emails, tickets, posts, contracts. Natural Language Processing (NLP) is how we turn that messy language into something a model can use. This module traces the field's whole arc in one sitting: from classic preprocessing and TF-IDF, through word embeddings, to the transformers that power today's language models — and crucially, how to use pretrained models in a few lines with Hugging Face. You will finish able to build a working text classifier and to apply state-of-the-art models without a research budget.

1Text preprocessing & linguistic features

Raw text must be broken into pieces and normalised before modelling. The classic pipeline: tokenise (split into words), drop stopwords (the, is, of), and lemmatise (running → run) so variants collapse to one form.

The classic NLP pipeline: text becomes tokens, gets cleaned, then turned into numbers.

Tokenise and tag with spaCy

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp('Apple is looking at buying a U.K. startup for $1 billion.')

for token in doc[:6]:
    print(f'{token.text:9s} lemma={token.lemma_:8s} pos={token.pos_:6s} stop={token.is_stop}')

▶ Output

Apple     lemma=Apple    pos=PROPN  stop=False
is        lemma=be       pos=AUX    stop=True
looking   lemma=look     pos=VERB   stop=False
at        lemma=at       pos=ADP    stop=True
buying    lemma=buy      pos=VERB   stop=False
a         lemma=a        pos=DET    stop=True

Named-entity recognition for free

for ent in doc.ents:
    print(f'{ent.text:12s} {ent.label_}')

▶ Output

Apple        ORG
U.K.         GPE
$1 billion   MONEY

How much to clean depends on the model. Classic models (TF-IDF) benefit from aggressive cleaning — lowercasing, stopword removal, lemmatising. Modern transformers prefer raw text with its punctuation and case intact, because they learned from it. Match the preprocessing to the model.

Key points

Preprocessing: tokenise → remove stopwords → lemmatise to collapse word variants.
spaCy gives tokens, lemmas, part-of-speech tags and named entities out of the box.
Clean aggressively for TF-IDF; leave text mostly raw for transformers.

2Representing text: Bag-of-Words & TF-IDF

Models need numbers, not words. Bag-of-Words counts how often each word appears; TF-IDF improves on it by down-weighting words common across all documents (so “the” counts for little) and up-weighting distinctive ones.

Each document becomes a row of TF-IDF weights — one number per vocabulary word.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['the cat sat on the mat',
          'the dog sat on the log',
          'cats and dogs are friends']

vec = TfidfVectorizer(stop_words='english')
X = vec.fit_transform(corpus)

print('Vocabulary:', list(vec.get_feature_names_out()))
print('Matrix shape (docs x words):', X.shape)

▶ Output

Vocabulary: ['cat', 'cats', 'dog', 'dogs', 'friends', 'log', 'mat', 'sat']
Matrix shape (docs x words): (3, 8)

Bag-of-Words throws away word order. “Dog bites man” and “man bites dog” get identical vectors. It is fast and surprisingly strong for classification, but for anything needing meaning or order, you need embeddings and transformers (next).

Key points

Bag-of-Words counts words; TF-IDF down-weights common words and up-weights distinctive ones.
Each document becomes a row in a document-term matrix of numeric weights.
These representations ignore word order — fast and effective for classification, but limited.

3Text classification: sentiment analysis

Put it together: TF-IDF features plus a classifier is a strong, fast baseline for tasks like spam detection, topic tagging and sentiment analysis — and it is just a scikit-learn Pipeline.

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# X_train: review texts   y_train: 1 = positive, 0 = negative
model = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2))),
    ('clf',   LogisticRegression(max_iter=1000)),
])
model.fit(X_train, y_train)

print('Test accuracy:', round(model.score(X_test, y_test), 3))
print(model.predict([
    'This movie was absolutely wonderful and moving',
    'A total waste of time, boring and predictable',
]))

▶ Output

Test accuracy: 0.883
[1 0]

Which words swing the prediction?

import numpy as np
words  = model.named_steps['tfidf'].get_feature_names_out()
coefs  = model.named_steps['clf'].coef_[0]

top_pos = np.argsort(coefs)[-5:][::-1]
top_neg = np.argsort(coefs)[:5]
print('Most positive:', [words[i] for i in top_pos])
print('Most negative:', [words[i] for i in top_neg])

▶ Output

Most positive: ['excellent', 'wonderful', 'great', 'best', 'loved']
Most negative: ['worst', 'boring', 'waste', 'awful', 'terrible']

Always build the simple baseline first. A TF-IDF + logistic-regression model trains in seconds, is fully interpretable (you can see which words drive it), and often gets you 85–90% of the way. Only reach for a transformer when that genuinely is not enough.

Key points

TF-IDF + a linear classifier in a Pipeline is a fast, strong text-classification baseline.
ngram_range=(1, 2) captures short phrases (bigrams) for extra signal.
Linear model coefficients reveal which words push toward each class — built-in interpretability.

4Word embeddings: meaning as vectors

TF-IDF treats “great” and “excellent” as unrelated. Word embeddings (word2vec, GloVe) fix this by mapping each word to a dense vector so that similar words sit close together — meaning becomes geometry.

The famous analogy: king − man + woman ≈ queen. Relationships become directions in vector space.

import gensim.downloader as api

wv = api.load('glove-wiki-gigaword-50')   # pretrained 50-d vectors

print('Most similar to "python":')
for word, score in wv.most_similar('python', topn=3):
    print(f'  {word:12s} {score:.3f}')

# Vector arithmetic: king - man + woman = ?
result = wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print('king - man + woman =', result[0][0])

▶ Output

Most similar to "python":
  perl         0.776
  php          0.742
  java         0.729
king - man + woman = queen

Embeddings capture context. They are learned by predicting which words appear together, so words used similarly end up nearby. The catch: classic embeddings give one fixed vector per word — “bank” (river) and “bank” (money) share a vector. Transformers solve that next.

Key points

Embeddings map words to dense vectors where similar meanings are geometrically close.
Relationships become directions: king - man + woman ≈ queen.
Classic embeddings give one vector per word regardless of context — a limitation transformers fix.

5Transformers & attention

The transformer is the architecture behind modern NLP (BERT, GPT and beyond). Its key idea is attention: when processing a word, the model looks at every other word and weighs how relevant each is — so meaning depends on context.

Attention lets “it” figure out it refers to “animal” — context-aware understanding.

Why transformers changed everything

Context-aware: the same word gets different vectors in different sentences (solving the “bank” problem).
Parallel: unlike older sequence models, they process a whole sentence at once — so they scale to huge data.
Pretrained & transferable: trained once on enormous text, then fine-tuned for your task with little data — exactly the transfer learning from Module 7.

You do not implement attention by hand. Understanding the idea — words weighing each other for context — is what matters. The next topic shows how to use these models in three lines via Hugging Face.

Key points

Attention lets each word weigh every other word, making representations context-aware.
Transformers process whole sequences in parallel and scale to massive datasets.
They are pretrained on huge corpora then fine-tuned for specific tasks (transfer learning).

6Practical NLP with Hugging Face

The Hugging Face ecosystem puts thousands of pretrained transformers one function call away. The pipeline API handles tokenising, running the model and decoding the output for you.

State-of-the-art sentiment in three lines

from transformers import pipeline

classifier = pipeline('sentiment-analysis')
result = classifier('I love how easy this library makes NLP!')
print(result)

▶ Output

[{'label': 'POSITIVE', 'score': 0.9998}]

Many tasks, same simple API

# Zero-shot: classify into labels the model was never trained on
zs = pipeline('zero-shot-classification')
print(zs('The GDP grew by 3% last quarter',
         candidate_labels=['economy', 'sports', 'technology'])['labels'][0])

# Summarisation, NER, translation, Q&A all use the same pattern
summariser = pipeline('summarization')
ner = pipeline('ner', grouped_entities=True)

▶ Output

economy

Task	pipeline name
Sentiment / classification	`'sentiment-analysis'`
Named-entity recognition	`'ner'`
Summarisation	`'summarization'`
Question answering	`'question-answering'`
Translation	`'translation'`
Text generation	`'text-generation'`

This is how modern NLP gets done. Start with a pretrained pipeline, evaluate it on your data, and only fine-tune if you must. The same skill transfers to large language models — the difference is scale, not concept. You now understand the whole stack from tokens to transformers.

Key points

Hugging Face pipeline runs pretrained transformers with one call — tokenising and decoding included.
The same API covers sentiment, NER, summarisation, Q&A, translation and generation.
Start from a pretrained model and evaluate; fine-tune only when necessary.

★ Hands-on Project — Build a Review Classifier (Two Ways)

Classify real text sentiment with both the classic and modern approaches, and compare them honestly.

Load a labelled text dataset (e.g. IMDB reviews, tweets, or product reviews) and split into train/test.
Preprocess with spaCy: tokenise, remove stopwords and lemmatise; inspect a few examples before/after.
Baseline: build a TF-IDF + LogisticRegression Pipeline (with bigrams) and report test accuracy.
Interpretability: list the top words pushing toward positive and negative.
Modern approach: run a Hugging Face sentiment-analysis pipeline on the same test texts and measure its accuracy.
Compare the two on accuracy, speed and interpretability — and note when each is the right choice.
Try one more Hugging Face task on your text (NER or zero-shot classification) and show example outputs.
Write a short comparison and recommendation, then commit the notebook to your portfolio.

Ready to test yourself?

Take the module quiz. Score 70% or more to mark this module complete.

Start the quiz →

💡 Log in to save your progress and earn the certificate.

← Previous

Deep Learning

Time Series Analysis & Forecasting