Outlier Mitigation in HDBSCAN

← Back to Methods

The Problem

The -1 Cluster

BERTopic relies on HDBSCAN — a density-based clustering algorithm — to identify topic clusters. HDBSCAN groups messages into dense areas of the embedding space and assigns a -1 label to any message that does not fall within or close to a dense area. These are not random noise: they are fringe discussions that sit at the periphery of the data distribution, often capturing nuanced or interdisciplinary content.

The problem is scale. In practice, the -1 cluster consistently accounted for 40–60% of the data across multiple projects. Discarding it means discarding a substantial portion of the corpus. Retaining it as a single undifferentiated bucket is analytically useless — it is too heterogeneous to describe or assign to a theme.

Three mitigation strategies were investigated: soft clustering, which uses HDBSCAN's own membership probability vectors to reassign fringe messages to nearby core topics; k-means, applied exclusively to the -1 cluster using the HDBSCAN topic count as k; and KNN classification, which uses annotated topic exemplars stored in a vector database to assign -1 messages at scale without requiring a separate fringe annotation pass. The first two were evaluated on South African economic discourse data; KNN classification was developed and validated in production on a 7.7M post multilingual corpus.

Background

How HDBSCAN Assigns -1

HDBSCAN constructs a minimum spanning tree over the dataset and applies single-linkage clustering, condensing the resulting hierarchy to find stable dense regions. The key concept is density reachability — clusters form where messages are tightly packed in the embedding space, regardless of shape. This flexibility is what makes HDBSCAN well-suited to topic modelling, where clusters are rarely spherical.

Every message in a BERTopic run is assigned a membership vector — a set of scores indicating the probability of association with each discovered cluster. For core topic messages, the vector peaks sharply at their assigned topic. For -1 messages, the vector is typically diffuse, with low scores distributed across multiple topics, reflecting genuine semantic ambiguity or peripheral position.

Crucially, these membership scores are not probabilities in the traditional sense. They represent the stability of a message's cluster membership across levels of the condensed dendrogram. A high score means the message consistently appeared in the same cluster as density thresholds varied; a low score means it was borderline. The -1 label is assigned when no cluster claims the message with sufficient stability.

The membership vector is what makes soft clustering viable — it encodes information about which core topics a fringe message is most related to, even if it was not assigned to any of them.

Approach 1

Soft Clustering

Soft clustering uses the membership vector to assign each -1 message to one or more core topics. Rather than forcing a single hard assignment, a message can be associated with up to five topics if its membership scores exceed a specified threshold. This reflects the actual analytical situation — fringe discussions are often genuinely relevant to more than one theme.

The approach introduces what we term fringe topics: new topic-like groupings formed by messages that share a dominant core topic association despite not being dense enough to form a cluster in HDBSCAN.

Key findings: Fringe topics were analytically distinct from core topics — they captured more peripheral or nuanced discussions rather than duplicating what HDBSCAN had already found. Contrary to expectation, the majority of fringe topics aligned primarily with a single core topic rather than spanning multiple. The multiplicity predicted by the membership vector often did not materialise as analytically meaningful cross-topic content; domain experts could see it, but it was not the norm.

Tradeoffs: Soft clustering provides interpretability advantages — working with membership vectors is far more tractable than tuning UMAP and HDBSCAN parameters to reduce the -1 rate. However, the soft clusterer itself requires parameter tuning, and the addition of fringe topics increases the annotation burden substantially. Each fringe topic still needs to be sampled and described by an analyst.

Approach 2

K-Means on the -1 Cluster

K-means is applied exclusively to the messages in the -1 cluster. The value of k is set to the number of dense topics identified by HDBSCAN — the reasoning being that the -1 cluster often reflects the same underlying structure as the core data, just at lower density. If HDBSCAN found 20 topics, the fringe content likely also spans roughly 20 themes.

Unlike HDBSCAN, k-means forces every message into a bucket — there is no residual -1 cluster. This is both its advantage and its limitation: full coverage, but at the cost of potentially coarse assignments for messages that are genuinely ambiguous.

Comparison with soft clustering: The two methods identified overlapping but not identical topic structures within the -1 cluster. Many high-level themes were shared — both found economic policy, corporate activity, and sectoral discussions — but the specific groupings differed. K-means surfaced some distinct topics not visible in the soft clustering output, likely corresponding to messages that soft clustering had assigned to the outlier residual. Conversely, some soft clustering fringe topics had no k-means equivalent.

Tradeoffs: K-means results are less interpretable than soft clustering because they are driven purely by centroid proximity rather than by the learned topic structure of the BERTopic run. There is no continuity with the core topic semantic map. However, k-means provides full coverage with a single annotation pass, which is operationally simpler.

Approach 3

KNN Classification via Exemplars

Rather than creating new fringe topics and annotating them separately, KNN classification uses the already-annotated core topics as the reference set. Each annotated topic contributes a set of exemplars — representative messages stored in a ChromaDB vector store. Each -1 message is then queried against this store and assigned to the topic whose exemplars are most similar by cosine similarity.

This approach treats outlier assignment as a retrieval problem rather than a clustering problem. The annotation investment already made on core topics is reused: no second annotation pass is needed for the fringe content.

Key findings: KNN classification generalised effectively from annotated exemplars to unseen -1 messages, achieving F1: 0.87 on a 7.7M post multilingual corpus. The method scales well — ChromaDB handles large embedding stores efficiently — and the threshold parameter (min_similarity) provides a principled mechanism for abstaining on genuinely ambiguous messages rather than forcing an assignment.

Tradeoffs: Performance is bounded by the quality and coverage of the exemplar set — topics with few or unrepresentative exemplars produce weaker classifications. Calibrating min_similarity requires inspection of the exemplar distance distribution for each topic. Messages below the threshold remain unclassified, which may or may not be acceptable depending on project requirements. See semantic-knn for parameter calibration guidance.

Conclusion

The three approaches address the -1 cluster at different stages of the workflow and with different cost profiles. Soft clustering and k-means generate new fringe topic groupings that require additional annotation — they increase analytical coverage at the cost of annotation overhead. KNN classification sidesteps that overhead by reusing the existing annotated exemplars, making it the most operationally efficient option for large corpora where a second annotation pass is not feasible. Soft clustering remains preferred when interpretability and traceability back to core topic structure matter most; k-means when full coverage is required and fringe topics can be processed independently. In all cases, the choice must be scoped into the project timeline.

A consistent finding across both approaches: fringe content is rarely true noise. The -1 cluster captures peripheral but coherent discourse — the kind of nuanced, interdisciplinary discussion that clients are often most interested in once the main themes are established.

Related Methods

Topic Allocation Workflow → Granular Annotation Scheme →

Tech Stack

Python BERTopic HDBSCAN UMAP scikit-learn sentence-transformers NumPy Pandas ChromaDB KNN