← Back to Methods

The Question

Source Text or Translation?

When a dataset is not in English, there are two broad approaches to topic modelling. The first is to represent the source language directly using a multilingual sentence transformer and cluster on those embeddings — the analyst then works with the original text. The second is to translate the dataset to English first, cluster on the translation, and present the analyst with the original non-English text broken down according to the translation-based clustering. The translation is under the hood; the analyst never sees it.

The translation-first approach has an intuitive appeal: monolingual English models are generally stronger than multilingual ones, and clustering on English embeddings may produce more coherent topics. But translation introduces error — the clustering is being driven by a representation of the text, not the text itself. If the translation is noisy, the topics will be noisy.

This investigation ran a controlled comparison across three setups on Arabic UN parallel corpus data: clustering on Arabic content using a multilingual model (the baseline), clustering on gold standard English (the ceiling), and clustering on machine-translated English (the scenario being evaluated). The question is how close the translation approach gets to the gold standard, and where it diverges.

Translation

Model Selection & Evaluation

Multiple translation models were evaluated. Most produced noisy, incomplete output — garbled text, special characters, incomplete sentences. The model selected was facebook/mbart-large-50-many-to-many-mmt, a multilingual BART supporting direct translation between 50 language pairs including Arabic–English. Its encoder-decoder architecture is designed to generate fluent output rather than match tokens exactly, which suits the goal here: capturing meaning for clustering, not producing a verbatim translation.

The evaluation benchmark was the OPUS parallel corpus (opus-2019-12-18). Two metrics were used: METEOR, which assesses partial and exact lexical matches allowing for paraphrase and stemming; and BERTScore, which measures semantic similarity using contextualised embeddings.

Results: METEOR average 0.68 — BERTScore F1 0.964. The divergence between these two scores is itself informative: METEOR penalises paraphrase and structural changes even when the meaning is preserved, while BERTScore treats semantically equivalent translations as correct. Since the downstream task is clustering on meaning rather than lexical matching, BERTScore is the more relevant signal here. However, cases where BERTScore is high but METEOR very low warrant manual inspection — the model occasionally produces fluent but semantically drifted output that embeddings do not catch.

Findings

Topic Homogeneity Comparison

Topic models were run on all three setups with matched parameters. Topic count varied slightly across setups: 64 topics for Arabic, 69 for English gold standard, 67 for translation. For evaluation, the top 10 and bottom 5 topics by size were annotated — a 10% sample per topic, with descriptions assigned per message and entropy computed to assess homogeneity.

Homogeneity results:

— Arabic model: 1 heterogeneous topic (6%)

— English gold standard: 1 heterogeneous topic (6%)

— Translation model: 5 heterogeneous topics (30%)

The translation approach produces substantially more heterogeneous topics than either the source text or gold standard approach. This is expected — translation errors compound into clustering noise — but the gap is larger than anticipated, and most of the degradation appeared in the larger topics at the top of the distribution.

Thematic alignment: Despite the homogeneity gap, the three setups identified broadly similar discussions — Cost, Human Rights, Law, Resolution, and Committee appeared across all three. Where setups diverged was in finer-grained distinctions: the translation model produced some unique topic descriptions that were related but not directly mappable to the Arabic or English topics, and missed some that the other two captured. The coverage was reasonable; the coherence of individual topics was the weaker point.

Considerations

When to Use Each Approach

Cluster on source text (multilingual model): Preferred when a strong multilingual sentence transformer exists for the target language and domain. Avoids compounding translation error into clustering. The analyst works with the original language throughout — no translation artefacts in the analytical output. Best suited for languages where multilingual models have strong coverage.

Translate first, cluster on translation: Appropriate when no adequate multilingual model exists for the language or domain, or when the English monolingual embedding space is significantly stronger for the analytical task at hand. The translation is invisible to the analyst — they receive the original text clustered by translated content. Requires a validated translation model and an understanding of where translation quality degrades (named entities, idiomatic language, long sentences).

Key practical consideration: Translation introduces a fixed quality ceiling. Topic coherence cannot exceed the quality of the translation — if the model paraphrases, drops named entities, or drifts semantically, those errors propagate directly into cluster definitions. The 30% heterogeneity rate observed here used a high-quality model on well-formed UN documents; on noisier social media data the gap would likely be larger.

Evaluation implication: When the translation approach is used, the thematic allocation should be evaluated on the original source text, not the translation — analysts characterise topics based on what they actually read, and that content is in the source language.

Applied in

Investigated on Arabic–English data using the UN Parallel Corpus. The translation-first pipeline was subsequently applied in multilingual projects covering Arabic, Turkish, and Farsi, informing model selection and the decision of whether to cluster on source text or translation.

Tech Stack

Python mBART HuggingFace Transformers sentence-transformers UMAP HDBSCAN BERTScore METEOR XLM-RoBERTa Pandas