Skip to content
Imtiaz Hossain

playground / nlp / text classification

News Topic Classification

Twenty-seven experiments in one grid. Hover any cell to compare accuracy against training cost, from a 1.5s logistic regression to a 35-minute BERT.

best macro-F1
0.9376
BERT-Base, none
best non-transformer
0.9214
Bi-GRU, 18.7s
BERT train time
~35 min
vs 19s for Bi-GRU
F1 gap
0.016
BERT over Bi-GRU
corpus
102k
/ training headlines, 4 classes
test set
12,000
/ balanced across classes
experiments
27
/ 9 models x 3 pipelines
imbalance
3.4x
/ handled by class weighting

interactive / live in your browser

modelnoneextremeoptimum
LogReg
DNN
RNN
GRU
LSTM
Bi-RNN
Bi-GRU
Bi-LSTM
BERT-Base

9 models x 3 preprocessing pipelines / ★ marks the best run / hover any cell

best run

BERT-Base

none / WordPiece

macro-F1
0.9376
training time
35.2 min

macro-F1 vs field

training cost (log)

BERT buys 0.016 macro-F1 over Bi-GRU for roughly 100x the training time.

the pipeline

From raw data to a verifiable result

  1. 01 / dataset

    Four topics, imbalanced

    102,002 training and 12,000 test headlines across Science & Technology, Business, Sports, and World News. Training is imbalanced 3.4x; the test set is balanced.

  2. 02 / preprocessing

    Three pipelines

    None (raw, HTML included), Extreme (stemming and full stopword removal), and Optimum (lemmatization, negation-preserving). Each is applied identically to every model.

    none: worst-case baselineextreme: porter stemmingoptimum: from EDA
  3. 03 / representations

    TF-IDF to WordPiece

    TF-IDF for the classical models, from-scratch Skip-gram embeddings for six recurrent variants, and WordPiece for BERT-Base.

  4. 04 / training

    Nine architectures

    From logistic regression to bidirectional gated networks to a fine-tuned transformer, all under inverse-frequency class weighting on an 8GB RTX 3070.

  5. 05 / results

    The 27-run matrix

    Below: every model times every pipeline. The story is in the contrasts, so hover cells to compare macro-F1 against training cost.

  6. 06 / insight

    BERT breaks the rule

    Preprocessing that helps shallow models hurts BERT: its WordPiece tokenizer handles raw HTML gracefully and is degraded by stemming. Preprocessing is model-specific, not universal.

evaluation artifacts