Topic Modelling Recipes

← Back to Methods

What This Is

Documentation as a Deliverable

Running topic modelling in production across multiple projects, languages, and client contexts surfaces a consistent set of problems — many of them non-obvious the first time you encounter them. This repo is the written record of those problems and the solutions that held up across projects.

It is not a library. It is a structured body of documentation: methodological guides, decision frameworks, investigation write-ups, and a stage-by-stage checklist. Each document is self-contained and cross-referenced, so a new project can be started from the relevant section without re-reading everything from scratch.

The documents were originally internal working documents, revised and extended as new projects introduced new problems. The repo makes them available as a reference for anyone doing similar work.

Scope

What It Covers

The documentation spans the full topic modelling workflow — from data collection decisions through to analysis — and covers the integration of machine translation for multilingual projects. It is organised into three tiers:

General documents — the overall workflow and a cross-cutting methodological guide that addresses stage-specific challenges and evolves as new projects add new considerations
Specific investigations — each focused on a single real problem: the sampling issue on large datasets, the -1 outlier topic, annotation granularity, and blind vs review evaluation
Machine translation — the workflow for integrating MT into topic modelling pipelines and the metrics used to evaluate translation quality

Document Index

Each document is self-contained. Cross-references within documents point to related sections for those who want deeper investigation into specific areas.

General

Topic Allocation General Workflow	Overview of the end-to-end topic modelling workflow and the types of projects it supports — embedding, dimensionality reduction, clustering, annotation, and evaluation. The entry point for new projects.
Methodological Considerations	Cross-project guide segmented by pipeline stage. Addresses challenges and solutions that recur across projects: seed list completeness, tokenisation decisions, model parameter tradeoffs, annotation alignment, evaluation split design. An evolving document updated as new projects contribute new lessons.

Specific Investigations

Sampling Bug	The specific failure mode that emerges when the topic modelling sampling approach is applied to large datasets. Documents the problem, the conditions under which it surfaces, and the solutions that resolve it without degrading cluster quality.
Outlier Mitigation	Investigation into the HDBSCAN -1 cluster, which consistently captures 40–60% of data. Compares soft clustering membership vectors and k-means reassignment as recovery strategies.
Granular Annotation Scheme	A proposed refinement to the standard topic annotation approach — entropy-based stopping criteria and proportional sampling per message rather than fixed-sample characterisation.
Blind vs Review Evaluation	Comparison of blind and review evaluation methods for thematic allocation — when each is appropriate, how anchoring bias emerges in review evaluation, and a hybrid approach that splits the evaluation sample.

Machine Translation

Machine Translation Workflow	How machine translation integrates with the topic modelling pipeline — when to translate before clustering vs after, the impact of translation quality on cluster coherence, and the operational workflow.
Machine Translation Metrics	Overview of applicable MT evaluation metrics — BERTScore as the primary quality signal, METEOR, and a structured error analysis covering named entity preservation, domain shift, and morphologically rich languages.

Project Checklist

A stage-by-stage quality gate distilled from recurring mistakes across projects. Each stage maps to the corresponding section of the Methodological Considerations document.

1 · Data Collection

Seed lists completed before collection starts
Time frame fixed and agreed
No missing data in the final corpus

2 · Preprocessing

Dataset size assessed — one model or split required?
Tokenisation impact checked on the data
Data divided into appropriate analytical units

3 · Model Creation

Embedding model chosen for the language and domain
Number-of-topics vs -1 tradeoff understood and tuned
Reproducibility measures in place (seeds, config versioning)

4 · Annotation Scheme

Scheme aligned with project methodology and goals
Regular analyst–annotator interaction scheduled
Qualitative vs quantitative end product clarified

5 · Evaluation

Annotation split into validation stage and test stage
Blind and review evaluation roles defined before annotation begins

6 · Analysis

All analytical claims backed by evaluation results
Insights scoped to what the data and annotation strategy can actually support

Tech Stack