work / ai / research / news-topic-classification
News Topic Classification
27 controlled experiments across preprocessing, embeddings, and models, revealing that BERT inverts the preprocessing rules that hold for everything else.
- best macro-f1
- 0.9376
- BERT-Base, no preprocessing
- best non-transformer
- 0.9214
- Bi-GRU in 18.7s
- experiments
- 27
- 9 models x 3 pipelines
- corpus
- 102k
- training headlines

system architecture / interactive
The question
Transformers dominate NLP headlines, but on short text like news titles, where context windows are tiny and pretraining signal is diluted, do classical bag-of-words pipelines and recurrent networks still compete? I built a single reproducible harness to find out, on a four-class corpus (Science & Technology, Business, Sports, World News) of 102,002 training and 12,000 test headlines.
The experimental grid
The design is deliberately factorial: nine models times three preprocessing pipelines = 27 controlled experiments, all under identical class-weighting and tuning protocols so I could isolate the marginal contribution of each decision.
The three preprocessing pipelines:
- None: raw text, HTML wrappers included. A worst-case baseline.
- Extreme: lowercase, HTML stripped, URL/digit/punctuation removed, NLTK stopwords removed, Porter stemming.
- Optimum: lowercase, HTML stripped, URL collapsing, light stopword removal that preserves negations, WordNet lemmatization, digits kept (finance and sports headlines need numbers).
The nine models: Logistic Regression and a 4-layer Deep NN over TF-IDF; SimpleRNN, GRU, LSTM and their bidirectional variants over from-scratch Skip-gram embeddings; and BERT-Base. The training set has a 3.4x class imbalance, so every model uses inverse-frequency class weighting, which is necessary rather than optional here.
Four patterns emerged
1. Preprocessing matters most for shallow models. The Deep NN gains 3.4 macro-F1 points
moving from none to optimum, while Bi-LSTM gains only 1.1, because its Skip-gram
embedding's frequency floor already filters HTML noise before it reaches the network.
2. Optimum beats Extreme for most recurrent models. Aggressive Porter stemming collapses semantically distinct tokens (international / intern) and removing all stopwords drops the negation cues that World News and Business reporting depend on.
3. Bidirectional gated cells dominate the non-transformer family. Bi-GRU on optimum
hits 0.9214 macro-F1 in 18.7 seconds of training, beating Logistic Regression and the
unidirectional GRU, and landing within 0.016 of BERT for two orders of magnitude less compute.
4. BERT inverts the preprocessing intuition. This is the surprise. BERT scores highest
on none (0.9376) and lowest on extreme (0.9288). WordPiece tokenization treats HTML
wrappers as predictable subword sequences and benefits from preserved capitalization, while
Porter stemming destroys the subword statistics BERT was pretrained on. The preprocessing
rule that helps every other architecture actively hurts BERT.
| Model | Best variant | Macro-F1 | Train time |
|---|---|---|---|
| BERT-Base | none | 0.9376 | ~35 min |
| Bi-GRU | optimum | 0.9214 | 18.7 s |
| Logistic Regression | extreme | 0.9209 | 1.5 s |
| Deep NN (worst) | none | 0.8804 | 52 s |
The takeaway
The dominant source of residual error is the Business / World News pair, which shares macro-economic and political vocabulary; Sports is essentially separable (F1 ~0.97 across the family). The headline conclusion for practitioners: treat preprocessing as model-specific, not universal. What cleans up a bag-of-words model can degrade a subword transformer. The playground renders the full 27-run matrix as an interactive heatmap so you can trace every one of these effects yourself.
stack
the evidence