Analytical Method
Classification Approaches
Three strategies for assigning categories to documents at scale — from no labelled data to fully supervised — and a framework for choosing between them
Developed at CASM Technology · Senior Data Scientist
Context
Classification After Topic Modelling
Topic modelling produces a thematic breakdown of a corpus — a set of discovered topics that describe what the data contains. In many projects, the next step is to go further: assign each document in the full corpus to a category from the analytical framework, producing counts, trends, and breakdowns that can be reported to a client or fed into downstream analysis.
This classification step sits at a practical junction. The analyst has just run a computationally expensive topic model on a sample or subset of the data; rerunning the full pipeline on the entire corpus is often not feasible. The question becomes: given the thematic structure discovered so far, how do we classify the remaining documents quickly and reliably?
The right approach depends on how much labelled data is available, how precisely the categories are defined, and what tradeoff between speed and accuracy the project tolerates. Three strategies cover the range of situations that arise in practice.
Approach 1 — No Labels Required
Zero-shot Classification (NLI)
Zero-shot classification uses a Natural Language Inference (NLI) model — typically a BART-based model fine-tuned on entailment tasks — to score how well each document entails a natural language description of each category. No labelled examples are needed: each category is expressed as a hypothesis ("this text is about X"), and the model assigns a probability that the document is a positive instance of that hypothesis.
This makes it well-suited for rapid prototyping, for classifying against categories defined mid-project before annotation has begun, or for cases where labelling is not feasible. The main constraint is hypothesis quality: the performance of zero-shot classification is directly determined by how well the hypothesis text captures what the category actually means. Vague or overly broad hypotheses produce noisy results.
In practice, zero-shot is most useful as a first pass — to estimate how well the categories are separable before committing to annotation, or as a weak signal combined with other methods in an ensemble. For projects where categories are well-defined and interpretable, it can perform surprisingly well with no annotation overhead at all.
Approach 2 — Few Labels Required
Exemplar-based k-NN
k-NN classification uses a small set of hand-annotated exemplar documents per category. Each exemplar is embedded using a sentence transformer and stored as a reference point. New documents are classified by finding their k nearest exemplars by cosine similarity and assigning the most common label among them — with no retraining of the embedding model.
This is practical when a few dozen well-chosen exemplars per category are available — which is typically true after an annotation round on the topic model output. It handles large, multilingual corpora efficiently because classification is a vector lookup rather than a model inference call per document. A ChromaDB-backed store makes the exemplar index persistent and queryable at scale.
Three voting strategies are available: majority vote (most common label among k neighbours), similarity-weighted sum (labels weighted by similarity score), and average similarity (labels weighted by mean similarity). The choice of strategy and the minimum similarity threshold are calibrated from the distance distribution of the exemplars themselves, grounding the thresholds in the actual geometric structure of each category rather than an arbitrary fixed value.
The quality of k-NN classification is directly tied to the quality of the embedding space. When categories that should be distinct are semantically entangled in the baseline embedding space, contrastive fine-tuning can resolve that entanglement and produce a measurable improvement in classification accuracy.
Approach 3 — More Labels Available
Keywords & ML Classifiers
Keyword-based classification uses explicit vocabulary lists or regex patterns to assign categories. It requires no model training and produces fully interpretable decisions: a document is assigned a category because it contains specific terms defined by the analyst. It is fast, auditable, and robust to domain shift — but its recall is limited to whatever vocabulary has been enumerated, making it brittle against paraphrase and variation in expression.
Traditional ML classifiers — logistic regression or SVM on TF-IDF or embedding features — require a labelled training set but generalise beyond explicit vocabulary. They are fast to train, well-understood, and produce probability scores that can be thresholded. A reasonable choice when a few hundred labelled examples are available and a lightweight, deployable model is needed.
Fine-tuned transformer classifiers reach the highest accuracy when sufficient labelled data is available. A pre-trained encoder (BERT, RoBERTa, or a domain-specific variant) is fine-tuned end-to-end on the labelled set, learning both the category boundaries and the relevant features jointly. Transfer learning from the pre-trained model reduces the data requirement substantially compared to training from scratch — in practice, a few hundred examples per class is often sufficient for good performance on well-defined categories.
In multilingual settings, multilingual transformer models (e.g. mBERT, XLM-RoBERTa) extend this approach across languages without separate models per language.
Choosing an Approach
The three approaches cover different positions on the labelled data and accuracy spectrum. In practice, projects rarely stay with a single method: zero-shot is used to validate category separability early, k-NN takes over once an annotation round has produced exemplars, and a fine-tuned classifier is introduced if accuracy requirements are high enough to justify the annotation cost.
| Labels needed | Accuracy ceiling | Best when | |
|---|---|---|---|
| Zero-shot NLI | None | Moderate | Categories are well-described in natural language; annotation is not yet feasible; rapid prototyping |
| Exemplar k-NN | ~20–100 per class | Good–high | An annotation round has been completed; large corpus; no retraining budget; embedding space is clean |
| Keywords | None (lexicon) | Low–moderate | Auditability is required; categories are narrow and vocabulary-stable; fast iteration on rules |
| Traditional ML | Hundreds per class | Good | Labelled data exists; lightweight deployable model required; interpretability of features valued |
| Fine-tuned Transformer | Hundreds–thousands | Highest | Sufficient labelled data; accuracy is the priority; multilingual coverage needed |
Tech Stack
Links
Library
Related Methods