← Back to Methods

What This Is

Documentation as a Deliverable

Running topic modelling in production across multiple projects, languages, and client contexts surfaces a consistent set of problems — many of them non-obvious the first time you encounter them. This repo is the written record of those problems and the solutions that held up across projects.

It is not a library. It is a structured body of documentation: methodological guides, decision frameworks, investigation write-ups, and a stage-by-stage checklist. Each document is self-contained and cross-referenced, so a new project can be started from the relevant section without re-reading everything from scratch.

The documents were originally internal working documents, revised and extended as new projects introduced new problems. The repo makes them available as a reference for anyone doing similar work.

Scope

What It Covers

The documentation spans the full topic modelling workflow — from data collection decisions through to analysis — and covers the integration of machine translation for multilingual projects. It is organised into three tiers:

  • General documents — the overall workflow and a cross-cutting methodological guide that addresses stage-specific challenges and evolves as new projects add new considerations
  • Specific investigations — each focused on a single real problem: the sampling issue on large datasets, the -1 outlier topic, annotation granularity, and blind vs review evaluation
  • Machine translation — the workflow for integrating MT into topic modelling pipelines and the metrics used to evaluate translation quality

Document Index

Each document is self-contained. Cross-references within documents point to related sections for those who want deeper investigation into specific areas.

General

Topic Allocation General Workflow Overview of the end-to-end topic modelling workflow and the types of projects it supports — embedding, dimensionality reduction, clustering, annotation, and evaluation. The entry point for new projects.
Methodological Considerations Cross-project guide segmented by pipeline stage. Addresses challenges and solutions that recur across projects: seed list completeness, tokenisation decisions, model parameter tradeoffs, annotation alignment, evaluation split design. An evolving document updated as new projects contribute new lessons.

Specific Investigations

Sampling Bug The specific failure mode that emerges when the topic modelling sampling approach is applied to large datasets. Documents the problem, the conditions under which it surfaces, and the solutions that resolve it without degrading cluster quality.
Outlier Mitigation Investigation into the HDBSCAN -1 cluster, which consistently captures 40–60% of data. Compares soft clustering membership vectors and k-means reassignment as recovery strategies.
Granular Annotation Scheme A proposed refinement to the standard topic annotation approach — entropy-based stopping criteria and proportional sampling per message rather than fixed-sample characterisation.
Blind vs Review Evaluation Comparison of blind and review evaluation methods for thematic allocation — when each is appropriate, how anchoring bias emerges in review evaluation, and a hybrid approach that splits the evaluation sample.

Machine Translation

Machine Translation Workflow How machine translation integrates with the topic modelling pipeline — when to translate before clustering vs after, the impact of translation quality on cluster coherence, and the operational workflow.
Machine Translation Metrics Overview of applicable MT evaluation metrics — BERTScore as the primary quality signal, METEOR, and a structured error analysis covering named entity preservation, domain shift, and morphologically rich languages.

Project Checklist

A stage-by-stage quality gate distilled from recurring mistakes across projects. Each stage maps to the corresponding section of the Methodological Considerations document.

1 · Data Collection

  • Seed lists completed before collection starts
  • Time frame fixed and agreed
  • No missing data in the final corpus

2 · Preprocessing

  • Dataset size assessed — one model or split required?
  • Tokenisation impact checked on the data
  • Data divided into appropriate analytical units

3 · Model Creation

  • Embedding model chosen for the language and domain
  • Number-of-topics vs -1 tradeoff understood and tuned
  • Reproducibility measures in place (seeds, config versioning)

4 · Annotation Scheme

  • Scheme aligned with project methodology and goals
  • Regular analyst–annotator interaction scheduled
  • Qualitative vs quantitative end product clarified

5 · Evaluation

  • Annotation split into validation stage and test stage
  • Blind and review evaluation roles defined before annotation begins

6 · Analysis

  • All analytical claims backed by evaluation results
  • Insights scoped to what the data and annotation strategy can actually support

Tech Stack

Python BERTopic sentence-transformers UMAP HDBSCAN mBART BERTScore seqeval Jupyter