work / ai / research / news-topic-classification

News Topic Classification

27 controlled experiments across preprocessing, embeddings, and models, revealing that BERT inverts the preprocessing rules that hold for everything else.

period

2026

status

research

Report (PDF) →Try the demo →

best macro-f1: 0.9376; BERT-Base, no preprocessing
best non-transformer: 0.9214; Bi-GRU in 18.7s
experiments: 27; 9 models x 3 pipelines
corpus: 102k; training headlines

system architecture / interactive

fig. 00 / news-topic-classification / hover nodes to trace the data flow

The question

Transformers dominate NLP headlines, but on short text like news titles, where context windows are tiny and pretraining signal is diluted, do classical bag-of-words pipelines and recurrent networks still compete? I built a single reproducible harness to find out, on a four-class corpus (Science & Technology, Business, Sports, World News) of 102,002 training and 12,000 test headlines.

The experimental grid

The design is deliberately factorial: nine models times three preprocessing pipelines = 27 controlled experiments, all under identical class-weighting and tuning protocols so I could isolate the marginal contribution of each decision.

The three preprocessing pipelines:

None: raw text, HTML wrappers included. A worst-case baseline.
Extreme: lowercase, HTML stripped, URL/digit/punctuation removed, NLTK stopwords removed, Porter stemming.
Optimum: lowercase, HTML stripped, URL collapsing, light stopword removal that preserves negations, WordNet lemmatization, digits kept (finance and sports headlines need numbers).

The nine models: Logistic Regression and a 4-layer Deep NN over TF-IDF; SimpleRNN, GRU, LSTM and their bidirectional variants over from-scratch Skip-gram embeddings; and BERT-Base. The training set has a 3.4x class imbalance, so every model uses inverse-frequency class weighting, which is necessary rather than optional here.

Four patterns emerged

1. Preprocessing matters most for shallow models. The Deep NN gains 3.4 macro-F1 points moving from none to optimum, while Bi-LSTM gains only 1.1, because its Skip-gram embedding's frequency floor already filters HTML noise before it reaches the network.

2. Optimum beats Extreme for most recurrent models. Aggressive Porter stemming collapses semantically distinct tokens (international / intern) and removing all stopwords drops the negation cues that World News and Business reporting depend on.

3. Bidirectional gated cells dominate the non-transformer family. Bi-GRU on optimum hits 0.9214 macro-F1 in 18.7 seconds of training, beating Logistic Regression and the unidirectional GRU, and landing within 0.016 of BERT for two orders of magnitude less compute.

4. BERT inverts the preprocessing intuition. This is the surprise. BERT scores highest on none (0.9376) and lowest on extreme (0.9288). WordPiece tokenization treats HTML wrappers as predictable subword sequences and benefits from preserved capitalization, while Porter stemming destroys the subword statistics BERT was pretrained on. The preprocessing rule that helps every other architecture actively hurts BERT.

Model	Best variant	Macro-F1	Train time
BERT-Base	none	0.9376	~35 min
Bi-GRU	optimum	0.9214	18.7 s
Logistic Regression	extreme	0.9209	1.5 s
Deep NN (worst)	none	0.8804	52 s

The takeaway

The dominant source of residual error is the Business / World News pair, which shares macro-economic and political vocabulary; Sports is essentially separable (F1 ~0.97 across the family). The headline conclusion for practitioners: treat preprocessing as model-specific, not universal. What cleans up a bag-of-words model can degrade a subword transformer. The playground renders the full 27-run matrix as an interactive heatmap so you can trace every one of these effects yourself.

stack

PyTorchHuggingFace TransformersBERT-BaseBi-GRULSTMgensim Word2Vecscikit-learnTF-IDF

the evidence

CampusSync