Granular Annotation Scheme

← Back to Methods

The Problem

Limits of Fixed-Sample Annotation

The standard approach to topic annotation selects a fixed number of messages per topic — typically 10 for quantitative projects, 5 for exploratory ones — and asks an analyst to characterise each topic from that sample. This fixed-sample approach has worked well across multiple projects, but it has structural weaknesses that become visible at more granular levels of analysis.

The first is methodological: a fixed sample size assumes equal statistical representativeness across all topics regardless of size. A topic with 50 messages and one with 5,000 messages both receive the same 10-message sample. This is hard to justify when the goal is to make quantitative claims about sub-theme volumes. When clients or reviewers ask why 10 was chosen, the honest answer is that it worked — not that it was derived from principled sampling theory.

The second is analytical: the analyst must characterise the topic from the sample as a whole, balancing homogeneity, analytical relevance, and any noise in a single judgement. In practice, topics are not uniformly homogeneous — some have a dominant narrative with minor peripheral messages, others are genuinely mixed. A fixed-sample approach does not surface this distinction; it surfaces whatever happened to be in the 10-message draw.

Performance degradation at sub-theme level — consistently observed across projects — is a direct consequence: the topic description is approximated from a small sample, and that approximation loses accuracy as the analytical unit gets smaller.

Approach

Granular Annotation Procedure

Sampling strategy: Rather than a fixed count, the sample size is proportional to the topic — approximately 10% of the topic's total messages, subject to minimum and maximum bounds. This aligns the annotation effort with the topic's size and produces a sample that can be more robustly defended as representative.

Annotation strategy: Instead of characterising the topic as a whole from a batch of messages, the analyst reviews messages individually. Each message receives a single short description. A description can apply to multiple messages if they share the same content, but the annotation decision is made at the message level rather than the topic level. Once the annotation pass is complete, the most frequently occurring description is designated as the overall topic description.

This unpacks a cognitive task that the fixed-sample approach collapses together: rather than holding homogeneity, relevance, and topic characterisation in mind simultaneously, the analyst makes one decision per message — "what is this about?" — and the topic description emerges from the aggregation.

Outcome: Topics are categorised as either homogeneous (the dominant description accounts for most messages) or heterogeneous (no clear dominant description). Heterogeneous topics are candidates for further treatment — multilayered topic modelling, keyword filtering, or explicit splitting — rather than being forced into a single characterisation that poorly fits their content.

Stopping Criteria

Entropy-Based Early Stopping

For small topics (sample under 20 messages), all messages are reviewed without a stopping criterion. For larger samples, annotation proceeds in batches of 10 messages, and entropy is calculated after each batch to determine whether to continue.

Entropy calculation: Given the set of descriptions assigned so far, Shannon entropy measures the distribution of descriptions across messages. A topic where every message receives the same description has entropy of zero — perfectly homogeneous. A topic where descriptions are uniformly distributed has maximum entropy — completely heterogeneous. The threshold used: a topic is considered heterogeneous if its normalised entropy exceeds 0.5 (i.e. entropy is greater than 50% of the maximum possible entropy for that number of unique descriptions).

Annotation patience: The key hyperparameter is annotation patience — the number of consecutive stable batches required before early stopping triggers. If entropy stabilises across iterations (the distribution of descriptions is not changing as more messages are reviewed), annotation stops after reaching the patience threshold. If entropy fluctuates — new descriptions keep appearing — annotation extends by one additional batch beyond the patience threshold before stopping. This prevents premature stopping on topics that appear stable early but have structural complexity deeper in the sample.

Rationale: The stopping criterion makes the annotation process empirically grounded rather than time-bounded. Rather than stopping when the analyst is tired or when a fixed count is reached, stopping is tied to information-theoretic stability in the emerging topic description.

Tradeoffs

Advantages & Limitations

Advantages:

Proportional sampling is methodologically defensible in a way that fixed-sample sizes are not — the sample size scales with the topic, making representativeness claims more robust.

Per-message annotation produces richer topic definitions. When topics are subsequently used to generate supervised training data (e.g. for a classifier), the definition is already at message level rather than needing to be reverse-engineered from a topic-level characterisation.

Topics from different models or pipeline runs can be more easily merged, since each is defined by individual message descriptions rather than a holistic characterisation that may not transfer.

The entropy-based evaluation is more tractable than previous evaluation rounds — because topic homogeneity is already known from the annotation pass, the evaluation phase can be streamlined rather than requiring multiple iterations.

Limitations:

Initial annotation takes longer than fixed-sample approaches. The time saving comes in evaluation — fewer iterations are needed — but the upfront cost is higher and harder to scope without knowing the entropy distribution in advance.

The hyperparameters (sample proportion, patience threshold, entropy acceptance threshold) require empirical validation. Current values are based on experience across a limited number of projects and will need refinement as more data accumulates. A regression model that predicts expected annotation accuracy from topic entropy is a natural future development but requires more project runs to train.

The scheme requires an interactive tool to implement effectively — annotators need to see messages, assign descriptions, and monitor entropy values in real time. Manual spreadsheet-based implementation is feasible but cumbersome for large topics.

Entropy is sensitive to annotation quality: it is calculated over the descriptions analysts write, so inconsistent phrasing — two annotators describing the same content differently, or the same annotator varying their wording across a session — artificially inflates entropy and can misclassify a homogeneous topic as heterogeneous. Ensuring consistent writing conventions, and potentially applying a normalisation or deduplication layer to descriptions before entropy is calculated, is a prerequisite for reliable stopping decisions.

Applied in

Developed as a proposed replacement for the fixed-sample annotation workflow used in quantitative topic modelling projects. Designed to address performance degradation at sub-theme level and to provide a more defensible sampling methodology for client-facing analysis.

Related Methods

Topic Allocation Workflow → Outlier Mitigation in HDBSCAN →

Tech Stack

Python NumPy Pandas SciPy Shannon Entropy