Analytical Method
Topic Allocation Workflow
Five-stage pipeline for large-scale narrative discovery — sentence transformers, UMAP, and HDBSCAN at its core
Developed at CASM Technology · Senior Data Scientist
Overview
Context & Challenge
Topic Allocation is a bottom-up approach to narrative discovery: it starts by dividing a corpus into semantically similar clusters, then studies and characterises each cluster, organising them into sub-themes and themes to produce a structured thematic breakdown. The result is a topic theme map — a document that assigns every message and topic to a position in a hierarchical analytical framework. While it has been applied primarily to social media and mainstream media corpora, the method is generic to any textual data.
The original motivation was to replace a slow, supervised classification workflow that was consistently blocking analysis at scale. The previous approach required identifying training examples on large datasets, training classifiers, iterating, and evaluating — a cycle that could take weeks per project with no guarantee of a useful output at the end. It was effectively waterfall: all the work up front, then a long wait to find out whether the direction had been right.
Topic Allocation was designed and evolved to make this process agile. Because it is unsupervised, it does not require labelled training data before analysis can begin. An analyst can have a workable thematic output within a day of data arriving, and a full analytical output produced collaboratively within a week. The evaluation and precision achieved through this approach has been on par with — and in some cases better than — the supervised pipelines it replaced, at a fraction of the iteration cost.
A further advantage is language coverage. Because the method relies on off-the-shelf multilingual sentence transformers rather than custom-trained classifiers, a single pipeline can be applied across languages without needing a separately fine-tuned model for each language, domain, or client question. This removes one of the most significant bottlenecks in multilingual research programmes.
The core of the approach is the combination of sentence-transformer embeddings, UMAP dimensionality reduction, and HDBSCAN clustering — these three components can be assembled directly or via a library like BERTopic that bundles them together conveniently. The choice of tooling is secondary; what matters is the pipeline design and how each stage is configured. Because the model's output is determined entirely by data distribution and density, any change to the underlying dataset — adding new sources, expanding a time window, correcting a collection bug — typically requires rerunning the full pipeline, making data collection decisions upstream of topic modelling unusually consequential.
Architecture
Five-Stage Pipeline
Stage 1 — Preprocessing: Text is cleaned to ensure that topic keyword representations are meaningful. Operations include removing links, mention signs, hashtag signs, emojis, encoding artefacts, and stopwords, followed by deduplication. Without this step, links and handles appear as top keywords, masking the actual topic content. Deduplication prevents HDBSCAN from creating artificial dense clusters around repeated messages.
Stage 2 — Semantic Representation: Each message is encoded into a contextualised embedding using a sentence-transformer model. The choice of model is a key decision — not all sentence transformers produce embeddings suited to the specific analytical need, and the model's max_seq_length determines how much of each message is encoded. For datasets with long documents, truncation can significantly affect cluster quality.
Stage 3 — Topic Modelling: The embeddings are reduced to a lower-dimensional space using UMAP, then HDBSCAN identifies dense areas — each area is a topic. HDBSCAN assigns a -1 label to messages that don't fall within any dense area. These fringe messages are not noise; they represent peripheral discussions that typically align with one core topic. Parameter tuning (UMAP dimensions, HDBSCAN minimum cluster size) is the most time-intensive part of the pipeline. Once topics are identified, c-TF-IDF is used to extract representative keywords for each.
Stage 4 — Thematic Allocation: An analyst samples messages from each topic and writes a description. Topics are then organised into sub-themes, which are grouped into broader themes. The resulting topic theme map becomes the reference document for analysis and evaluation. For quantitative projects, 10 messages per topic is the standard starting point; for qualitative exploration, 5 is typical. A more rigorous proportional sampling approach with entropy-based stopping criteria is documented separately — see Granular Annotation Scheme.
Stage 5 — Evaluation: A representative sample of annotated messages is reviewed to assess precision and recall of the thematic allocation. Evaluation can be done in review mode (analyst can see the topic model annotations) or blind mode (annotations hidden). General sampling applies when themes are balanced; stratified sampling is used when specific sub-themes are underrepresented and analytically important.
Design Decisions
Key Considerations
Data collection timing: Topic models are highly sensitive to data distribution. Adding new accounts, expanding a date range, or correcting a collection bug after the model has been trained typically requires a full pipeline rerun. In practice, this means the data collection scope must be agreed and finalised before topic modelling begins — mid-project additions are expensive.
Sentence transformer selection: Models trained on sentence similarity tasks are used rather than general-purpose encoders. The choice depends on the domain, language, and the analytical goal — a model that produces good general-purpose clusters may not distinguish the specific narratives a project requires. Guided topic modelling, where a sentence transformer is fine-tuned using contrastive learning on project-labelled examples, is used when the default embedding space is insufficient.
Parameter tuning vs. reproducibility: UMAP and HDBSCAN are both sensitive to their hyperparameters, and the relationship between parameters and output topics is not always predictable. Tuning is an iterative process — adjust, rerun, sample, assess. Embeddings are saved after Stage 2 so that parameter sweeps in Stage 3 do not require recomputing embeddings, which is the most computationally expensive part of the pipeline.
UMAP extrapolation limitation: When the topic model is trained on a sample of a large dataset and then applied to the full dataset, UMAP's projections can shift substantially — topics that were tightly defined on the sample can expand and drift when projected for the full corpus. This is a fundamental limitation of UMAP rather than a tuning problem, and it means that training-on-sample strategies require careful validation before deployment at scale.
Variants
Workflow Adaptations
Multilayered topic modelling: When an initial round of topic modelling produces heterogeneous topics — clusters that are semantically broad or that mix distinct narratives — a second layer of topic modelling is applied to those specific topics. This produces finer-grained sub-topics, which are then integrated into the annotation schema alongside the first-layer topics. Each layer maintains its own UMAP semantic map to keep visualisations coherent.
Guided topic modelling: When the analytical goal is to find topics aligned with specific predefined categories rather than whatever naturally emerges from the data, sentence transformers can be fine-tuned using contrastive learning. Positive and negative example pairs are used to adjust the embedding space so that it reflects the project's analytical distinctions. This converts topic modelling from purely unsupervised to a form of active learning — useful when a client has a specific thematic framework they need to validate against the data.
Multilingual pipelines: For datasets spanning multiple languages, two strategies are used: embedding the source text directly using a multilingual sentence transformer (if a suitable model exists), or translating to English first and running a monolingual model. The translation approach introduces quality risk — translation errors propagate into topic definitions — and requires a structured evaluation of the translation model before use. See Machine Translation Evaluation for the full framework used in these decisions.
Outlier mitigation: The HDBSCAN -1 cluster can contain 40–60% of the data in some datasets. Two approaches — soft clustering (using the membership probability vector to assign fringe messages to nearby core topics) and k-means (applied to the -1 cluster using the HDBSCAN topic count as k) — have been experimentally evaluated. See Outlier Mitigation in HDBSCAN for the full investigation.
Applied in
Used across multiple social media analysis and mainstream media monitoring projects — including advocacy tracking, disinformation analysis, and economic discourse analysis — spanning English and multilingual corpora in Arabic, Turkish, and Farsi.