← Back to Methods

The Problem

Heterogeneous Topics

A single-pass topic model produces topics at a level of granularity determined by the data density and the HDBSCAN parameters. In practice, some of these topics are analytically heterogeneous — they contain a semantically coherent cluster from HDBSCAN's perspective, but from an analyst's perspective the messages cover multiple distinct sub-narratives that should be reported separately.

The traditional response to this is filtering: applying keyword-based classifiers or ML classifiers to subdivide the heterogeneous topic. This works, but it requires defining the classifier before you know exactly what the sub-narratives are — and that definition typically needs to be done by someone with domain knowledge who has already looked at the data. It also locks the subdivision into the categories chosen at the point of classifier definition, which may not match what actually emerges when the sub-narratives are examined closely.

Multilayered topic modelling addresses this by running a second clustering pass on the messages within a heterogeneous topic, producing sub-topics from the data rather than from predefined categories. The sub-narratives are discovered, not declared. This preserves the exploratory quality of the original pipeline while still achieving the analytical granularity the project needs.

Approach

The Layered Pipeline

Layer 1 — Initial exploration: A standard topic modelling run across the full corpus. This produces a broad thematic breakdown. Topics are annotated, described, and organised into a preliminary theme map. Topics flagged as heterogeneous — containing multiple distinct sub-narratives — are candidates for a second layer.

Layer 2 — Refinement: Each heterogeneous topic is modelled independently. The messages from that topic are extracted and a new clustering run is applied, typically with adjusted UMAP and HDBSCAN parameters suited to the smaller corpus. This produces sub-topics specific to that topic's content. Annotators describe and categorise these sub-topics, and the results are integrated back into the overall theme map as a finer-grained branch.

Layer 3 (optional): If a Layer 2 sub-topic is itself heterogeneous, the process can be applied again. In practice, three layers is typically the maximum before the sub-corpora become too small for stable clustering.

Integration: Each layer produces its own annotation output. The annotation schema must be designed to bridge across layers — each message receives an annotation at every layer it appears in, and the final topic assignment reflects the deepest layer at which it was assigned a stable description. This requires careful schema design to avoid ambiguity when messages appear at multiple levels of the hierarchy.

Semantic maps: Each layer maintains its own UMAP projection. Mixing semantic maps across layers produces misleading visualisations — the UMAP coordinates from a full-corpus run and a sub-corpus run are not comparable. Keeping them separate preserves the interpretability of each layer's visual output.

vs. Classifiers

Why Not Just Use Classifiers?

Keyword and ML classifiers are the conventional tool for refining topic content. They work — and in some projects, particularly quantitative ones with well-defined categories, they remain the right choice. But they have constraints that multilayered topic modelling avoids.

Category definition: A classifier requires you to define the categories before you see the sub-narratives clearly. In practice, defining good classifier categories requires reviewing the data — which means you are partly doing the analytical work before the tool is even built. Multilayered topic modelling does the analytical work directly, using the data distribution itself to define the sub-categories.

Flexibility: A classifier trained on predefined categories cannot surface a sub-narrative that was not anticipated when the categories were defined. Topic modelling at Layer 2 can. If an unexpected sub-narrative exists in the data at sufficient density, it will form a cluster.

Complementarity: The approaches are not mutually exclusive. In quantitative projects, classifiers are often used at the theme level for precision — zero-shot classification, keyword filtering, or guided topic modelling — while multilayered topic modelling handles the sub-theme level where the category structure is less well-defined. The choice depends on how well-specified the analytical framework is at each level of the hierarchy.

Challenges

Annotation Schema & Practical Constraints

Schema design: The primary challenge is building an annotation schema that works consistently across layers. A message that appears in Layer 1 Topic 3 and Layer 2 Sub-topic 3b needs to carry both annotations cleanly. The schema must be agreed before Layer 2 annotation begins — changing it mid-process means reannotating at least partial outputs.

Annotator load: Each additional layer adds annotation work. The messages from a heterogeneous topic must be described and categorised again at the sub-topic level, even if they were already annotated at Layer 1. Careful selection of which topics to refine — based on heterogeneity scores and analytical priority — is essential to keep the annotation overhead manageable.

Cluster stability: Sub-corpora are smaller than the full corpus, and small corpora can produce unstable clusters — topics that shift substantially with small parameter changes. More conservative HDBSCAN parameters (larger minimum cluster sizes) are typically needed at Layer 2 to prevent fragmentation into clusters that are too small to annotate reliably.

Reporting: The output of a multilayered run is a hierarchical topic structure rather than a flat list. Reporting tools and client-facing outputs need to reflect this hierarchy clearly, or the additional analytical depth is lost in presentation.

Applied in

Applied in multiple projects where initial topic models produced thematically broad clusters that required finer analytical breakdown. Particularly effective in large corpora where a single-pass model captures macro-level themes but misses the sub-narrative structure that clients need for reporting.

Tech Stack

Python sentence-transformers UMAP HDBSCAN scikit-learn Pandas