Analytical Method
Guided Topic Modelling
Steering clustering toward specific analytical objectives by fine-tuning the embedding space with contrastive learning
Developed at CASM Technology · Senior Data Scientist
Motivation
The Limitation of Standard Embeddings
Standard topic modelling is fully unsupervised: clusters emerge from whatever structure exists in the embedding space, and the analyst discovers what the data contains after the fact. This is powerful for open-ended exploration but is a poor fit for projects where the analytical framework is defined in advance — where a client asks not "what is in this data?" but "how much of this data is about X, Y, and Z?"
The embedding model is the root of this limitation. Sentence transformers are trained on general-purpose tasks — sentence similarity, paraphrase detection — and the resulting embedding space reflects broad semantic similarity. Two sentences that discuss different aspects of the same topic may cluster together because they share vocabulary, while two sentences that are analytically distinct may sit close together because they are stylistically similar. The model has no knowledge of the project's analytical distinctions.
Guided topic modelling addresses this by fine-tuning the sentence transformer on the specific distinctions that matter for the project, before clustering is run. The result is an embedding space shaped around the analytical framework, not just general semantic similarity — clusters that emerge from UMAP and HDBSCAN are then more likely to align with the categories the project is trying to find.
Approach
Contrastive Learning
Contrastive learning fine-tunes the sentence transformer by training it on pairs of sentences labelled as positive (should be close in the embedding space) or negative (should be far apart). For a topic modelling project, positive pairs are drawn from messages that represent the same analytically relevant category; negative pairs are drawn from messages that the model should distinguish — either because they are in different relevant categories, or because one is relevant and one is noise.
This fine-tuning process adjusts the embedding space so that the dimensions that matter for the project's analytical distinctions become more prominent. Two messages discussing the same client-defined narrative will land closer together; messages from different narratives will be pushed further apart. UMAP and HDBSCAN then operate on this adjusted space, producing clusters that are shaped by the analytical framework rather than general semantic similarity alone.
The key requirement is labelled examples — positive and negative pairs drawn from the actual project data. These can come from prior annotation rounds, from keyword-based pre-filtering, or from an initial unsupervised run used to generate candidate examples for a human to label. The approach is well-suited to active learning workflows: each annotation round produces examples that refine the embedding space, which produces better clusters, which make the next annotation round more efficient.
Few-shot learning frameworks can substantially reduce the number of labelled examples required. Rather than training from scratch, a pre-trained sentence transformer is fine-tuned on a small number of contrastive pairs — in practice, a few dozen examples per category can be sufficient to meaningfully shift the embedding space toward the project's analytical distinctions.
When to Use
Applicability
Predefined analytical framework: When the client has specified in advance the narratives or categories they want to identify, and the standard unsupervised approach produces topics that do not align cleanly with those categories, guided topic modelling provides a way to steer the embedding space toward the predefined framework without requiring a full supervised classifier.
Active learning: When the goal is to iteratively build a classification capability from few labelled examples, guided topic modelling functions as an active learning loop — each annotation pass refines the model, which improves the candidate examples presented for the next pass. The output is not just a thematic breakdown but a progressively improving embedding model tailored to the domain.
When not to use: If the analytical goal is genuinely exploratory — discovering what narratives exist rather than confirming predefined ones — guided topic modelling will bias the clustering toward the analyst's prior assumptions. The unsupervised approach is more appropriate in that case. Guided modelling is also more expensive: it requires labelled examples, a fine-tuning run, and careful validation that the adjusted embedding space has not collapsed distinct narratives into a single cluster.
Tradeoffs
Advantages & Limitations
Advantages:
Produces clusters that align more closely with the project's analytical objectives than a purely unsupervised approach, without requiring a fully labelled dataset for a supervised classifier.
Works within the existing pipeline — the fine-tuned model replaces the standard sentence transformer at Stage 2, and the rest of the pipeline (UMAP, HDBSCAN, thematic allocation, evaluation) runs unchanged.
Few-shot capable: meaningful improvements in cluster alignment can be achieved with relatively small numbers of labelled examples, making it practical within project timelines.
Limitations:
Requires labelled examples — the quality and representativeness of the positive/negative pairs directly determines the quality of the fine-tuned embedding space. Poorly chosen pairs can bias the model in unexpected ways.
Risk of collapsing the embedding space: overly aggressive fine-tuning on a narrow set of distinctions can reduce the model's sensitivity to other dimensions of variation in the data, producing clusters that are analytically coherent on the target categories but blind to everything else.
Validation is more complex than for unsupervised topic modelling: the standard qualitative checks on topic homogeneity need to be supplemented with confirmation that the guided categories are being correctly separated and that the clusters are not simply memorising the training examples.
Applied in
Applied in projects where clients specified predefined analytical categories and standard unsupervised clustering did not produce sufficient alignment. Also used within active learning pipelines where iterative annotation was used to progressively refine the embedding model.
Related Methods
The same contrastive learning technique was later applied to a different problem: improving k-NN classification accuracy on population-scale annotated data, with no topic modelling involved. That experiment — Contrastive Fine-tuning for Classification — shares the underlying method but operates entirely downstream of the topic modelling pipeline, taking already-annotated data as input and using classification performance as the evaluation signal rather than cluster quality.