Most of the world's data is text — reviews, emails, tickets, posts, contracts. Natural Language Processing (NLP) is how we turn that messy language into something a model can use. This module traces the field's whole arc in one sitting: from classic preprocessing and TF-IDF, through word embeddings, to the transformers that power today's language models — and crucially, how to use pretrained models in a few lines with Hugging Face. You will finish able to build a working text classifier and to apply state-of-the-art models without a research budget.
1Text preprocessing & linguistic features
Raw text must be broken into pieces and normalised before modelling. The classic pipeline: tokenise (split into words), drop stopwords (the, is, of), and lemmatise (running → run) so variants collapse to one form.
Tokenise and tag with spaCy
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Apple is looking at buying a U.K. startup for $1 billion.')
for token in doc[:6]:
print(f'{token.text:9s} lemma={token.lemma_:8s} pos={token.pos_:6s} stop={token.is_stop}')Apple lemma=Apple pos=PROPN stop=False is lemma=be pos=AUX stop=True looking lemma=look pos=VERB stop=False at lemma=at pos=ADP stop=True buying lemma=buy pos=VERB stop=False a lemma=a pos=DET stop=True
Named-entity recognition for free
for ent in doc.ents:
print(f'{ent.text:12s} {ent.label_}')Apple ORG U.K. GPE $1 billion MONEY
- Preprocessing: tokenise → remove stopwords → lemmatise to collapse word variants.
- spaCy gives tokens, lemmas, part-of-speech tags and named entities out of the box.
- Clean aggressively for TF-IDF; leave text mostly raw for transformers.
2Representing text: Bag-of-Words & TF-IDF
Models need numbers, not words. Bag-of-Words counts how often each word appears; TF-IDF improves on it by down-weighting words common across all documents (so “the” counts for little) and up-weighting distinctive ones.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['the cat sat on the mat',
'the dog sat on the log',
'cats and dogs are friends']
vec = TfidfVectorizer(stop_words='english')
X = vec.fit_transform(corpus)
print('Vocabulary:', list(vec.get_feature_names_out()))
print('Matrix shape (docs x words):', X.shape)Vocabulary: ['cat', 'cats', 'dog', 'dogs', 'friends', 'log', 'mat', 'sat'] Matrix shape (docs x words): (3, 8)
- Bag-of-Words counts words; TF-IDF down-weights common words and up-weights distinctive ones.
- Each document becomes a row in a document-term matrix of numeric weights.
- These representations ignore word order — fast and effective for classification, but limited.
3Text classification: sentiment analysis
Put it together: TF-IDF features plus a classifier is a strong, fast baseline for tasks like spam detection, topic tagging and sentiment analysis — and it is just a scikit-learn Pipeline.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
# X_train: review texts y_train: 1 = positive, 0 = negative
model = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2))),
('clf', LogisticRegression(max_iter=1000)),
])
model.fit(X_train, y_train)
print('Test accuracy:', round(model.score(X_test, y_test), 3))
print(model.predict([
'This movie was absolutely wonderful and moving',
'A total waste of time, boring and predictable',
]))Test accuracy: 0.883 [1 0]
Which words swing the prediction?
import numpy as np
words = model.named_steps['tfidf'].get_feature_names_out()
coefs = model.named_steps['clf'].coef_[0]
top_pos = np.argsort(coefs)[-5:][::-1]
top_neg = np.argsort(coefs)[:5]
print('Most positive:', [words[i] for i in top_pos])
print('Most negative:', [words[i] for i in top_neg])Most positive: ['excellent', 'wonderful', 'great', 'best', 'loved'] Most negative: ['worst', 'boring', 'waste', 'awful', 'terrible']
- TF-IDF + a linear classifier in a Pipeline is a fast, strong text-classification baseline.
ngram_range=(1, 2)captures short phrases (bigrams) for extra signal.- Linear model coefficients reveal which words push toward each class — built-in interpretability.
4Word embeddings: meaning as vectors
TF-IDF treats “great” and “excellent” as unrelated. Word embeddings (word2vec, GloVe) fix this by mapping each word to a dense vector so that similar words sit close together — meaning becomes geometry.
import gensim.downloader as api
wv = api.load('glove-wiki-gigaword-50') # pretrained 50-d vectors
print('Most similar to "python":')
for word, score in wv.most_similar('python', topn=3):
print(f' {word:12s} {score:.3f}')
# Vector arithmetic: king - man + woman = ?
result = wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print('king - man + woman =', result[0][0])Most similar to "python": perl 0.776 php 0.742 java 0.729 king - man + woman = queen
- Embeddings map words to dense vectors where similar meanings are geometrically close.
- Relationships become directions:
king - man + woman ≈ queen. - Classic embeddings give one vector per word regardless of context — a limitation transformers fix.
5Transformers & attention
The transformer is the architecture behind modern NLP (BERT, GPT and beyond). Its key idea is attention: when processing a word, the model looks at every other word and weighs how relevant each is — so meaning depends on context.
Why transformers changed everything
- Context-aware: the same word gets different vectors in different sentences (solving the “bank” problem).
- Parallel: unlike older sequence models, they process a whole sentence at once — so they scale to huge data.
- Pretrained & transferable: trained once on enormous text, then fine-tuned for your task with little data — exactly the transfer learning from Module 7.
- Attention lets each word weigh every other word, making representations context-aware.
- Transformers process whole sequences in parallel and scale to massive datasets.
- They are pretrained on huge corpora then fine-tuned for specific tasks (transfer learning).
6Practical NLP with Hugging Face
The Hugging Face ecosystem puts thousands of pretrained transformers one function call away. The pipeline API handles tokenising, running the model and decoding the output for you.
State-of-the-art sentiment in three lines
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
result = classifier('I love how easy this library makes NLP!')
print(result)[{'label': 'POSITIVE', 'score': 0.9998}]Many tasks, same simple API
# Zero-shot: classify into labels the model was never trained on
zs = pipeline('zero-shot-classification')
print(zs('The GDP grew by 3% last quarter',
candidate_labels=['economy', 'sports', 'technology'])['labels'][0])
# Summarisation, NER, translation, Q&A all use the same pattern
summariser = pipeline('summarization')
ner = pipeline('ner', grouped_entities=True)economy
| Task | pipeline name |
|---|---|
| Sentiment / classification | 'sentiment-analysis' |
| Named-entity recognition | 'ner' |
| Summarisation | 'summarization' |
| Question answering | 'question-answering' |
| Translation | 'translation' |
| Text generation | 'text-generation' |
- Hugging Face
pipelineruns pretrained transformers with one call — tokenising and decoding included. - The same API covers sentiment, NER, summarisation, Q&A, translation and generation.
- Start from a pretrained model and evaluate; fine-tune only when necessary.
★ Hands-on Project — Build a Review Classifier (Two Ways)
Classify real text sentiment with both the classic and modern approaches, and compare them honestly.
- Load a labelled text dataset (e.g. IMDB reviews, tweets, or product reviews) and split into train/test.
- Preprocess with spaCy: tokenise, remove stopwords and lemmatise; inspect a few examples before/after.
- Baseline: build a TF-IDF + LogisticRegression Pipeline (with bigrams) and report test accuracy.
- Interpretability: list the top words pushing toward positive and negative.
- Modern approach: run a Hugging Face
sentiment-analysispipeline on the same test texts and measure its accuracy. - Compare the two on accuracy, speed and interpretability — and note when each is the right choice.
- Try one more Hugging Face task on your text (NER or zero-shot classification) and show example outputs.
- Write a short comparison and recommendation, then commit the notebook to your portfolio.
Ready to test yourself?
Take the module quiz. Score 70% or more to mark this module complete.
Start the quiz →💡 Log in to save your progress and earn the certificate.