Topic Model Evaluation

← Back to Methods

Overview

Purpose & Challenge

Evaluation in topic modelling is not a single task — it serves two distinct purposes simultaneously. The first is performance measurement: assessing precision and recall of the thematic allocation, confirming that messages are correctly assigned to their themes and sub-themes. The second is review: giving analysts a structured pass through the data to identify labelling inconsistencies, noise in the annotations, and unstable thematic categories that need corrective action.

These dual purposes are in tension. A rigorous evaluation designed purely for performance measurement would be conducted blind — analysts assess the data without seeing the topic model's own annotations, eliminating the risk that they anchor to the model's output rather than forming independent judgements. But a fully blind evaluation foregoes the review function, requiring analysts to go through the same sample multiple times: once to evaluate, and again to identify what needs fixing.

The evaluation can be run at two hierarchical levels — theme and sub-theme — and with two sampling strategies: general sampling when themes are balanced, and stratified sampling when specific sub-themes are underrepresented but analytically important. The choice between these is typically determined at the project scoping stage based on the analytical objectives.

Modes

Review vs Blind Evaluation

Review mode: The sample presented to analysts includes the topic model annotations — the theme and sub-theme assigned to each message. Analysts agree or disagree with each assignment, flagging misallocations. Because analysts can see the model's output, they develop a clear picture of where the thematic breakdown is solid and where it is noisy, and they can immediately identify what corrective action is needed. This was the standard approach in the majority of projects.

Blind mode: Annotations are hidden. Analysts read the message and independently assign a theme and sub-theme, then their assignments are compared against the model's. This eliminates anchoring bias — analysts cannot defer to the model's output. However, if performance is low, analysts must then review the same sample a second time to identify the specific misallocations and understand what went wrong. In practice this means going through the same sample two or three times before corrective action can be taken.

When the bias concern emerged: During the BMJV project, concerns were raised that showing analysts the topic model annotations was influencing their evaluation judgements — they were more likely to agree with the model's assignment than to flag it as incorrect, inflating apparent precision. This prompted a reconsideration of review as the default approach. The tradeoff is real: review mode risks anchoring bias; blind mode risks analytical inefficiency.

Approach

Hybrid Evaluation

The proposed resolution is a hybrid approach that allocates the evaluation sample across two subsets rather than choosing between review and blind evaluation entirely.

Validation subset (review mode): A portion of the sample is shown to analysts with annotations visible. This is the review pass — analysts check for labelling inconsistencies, identify noise, and build their understanding of the topic structure. If performance on this subset is satisfactory, it indicates the thematic breakdown is robust enough to proceed to blind testing.

Test subset (blind mode): The remaining portion is evaluated without annotations. This is the performance measurement pass — producing an unbiased precision and recall estimate at theme or sub-theme level.

The key decision variable is analyst confidence in the topic theme map. If the map is clear and homogeneous topics are well-characterised, moving to blind evaluation on the test subset is low-risk. If there is uncertainty about annotation robustness, increasing the validation subset size before moving to blind mode reduces the risk of a low-performance result that requires a full re-evaluation.

This hybrid approach is more efficient than a fully blind evaluation because the review pass means that corrective actions are identified concurrently with the evaluation, rather than requiring an additional pass through the data afterward.

Sampling

General vs Stratified

General sampling draws a representative sample proportional to the theme and sub-theme distribution in the full annotated dataset. This is appropriate when themes are balanced — when no single theme dominates the corpus to the point that a proportional sample would underrepresent the rest. It produces a precision and recall estimate that reflects overall performance across the thematic breakdown.

Stratified sampling oversamples from specific themes or sub-themes — typically the smaller or less-represented ones. This is appropriate when the analytical focus of the project is on those underrepresented sub-themes, or when a balanced evaluation is needed for reporting purposes even if the underlying distribution is skewed. Stratified evaluation answers the question of whether the model performs consistently across the full thematic breakdown, not just well on the largest themes.

The choice of sampling strategy should be agreed with analysts and, where relevant, with clients before evaluation begins. Changing the sampling strategy after initial results have been produced makes it difficult to compare across evaluation rounds.

Applied in

Applied across quantitative and qualitative topic modelling projects. The review vs blind bias concern was specifically surfaced during the BMJV project; the hybrid approach was developed as a response and applied in subsequent work.

Related Methods

Topic Allocation Workflow → Granular Annotation Scheme →