Machine Translation Evaluation

← Back to Methods

Overview

Context & Challenge

Multilingual NLP projects require translation for two distinct purposes. The first is facilitating annotator analysis — translating source text into a language the analyst reads, so they can characterise topics while the model still runs on the original. The second is unifying data across languages into a single target language (typically English) for backend processing such as topic modelling or NER, where the source language lacks strong transformer support.

The core evaluation challenge is that standard n-gram metrics like BLEU systematically underestimate quality for morphologically rich languages — Arabic, Turkish, Urdu, Farsi — where the same meaning can be expressed with very different surface forms. A robust evaluation framework needs metrics that capture semantic similarity, not just lexical overlap.

A further complication is gold standard quality. Evaluation corpora vary significantly in how literally they were translated — scores computed against a loosely translated reference will diverge from scores against a literal one, even if the model output is identical. Manual spot-checking of low-scoring examples is always part of the process.

Approach

Three-Stage Workflow

Stage 1 — Exploratory: Identify a parallel benchmark corpus that represents the text type in the project. Confirm it was not used in training the candidate models. Investigate available translation models on HuggingFace for the target language pair — noting model type (one-to-one, many-to-many), language code format, and hosting requirements.

Stage 2 — Translation: Run candidate models on the benchmark dataset. Record translations alongside computational performance — time per instance at different batch sizes in both CPU and GPU settings. This determines whether a model is feasible for production-scale data.

Stage 3 — Evaluation: Score translations using METEOR and BERTScore. Sample across score ranges for manual error analysis. Flag systematic patterns — named entity preservation, idiomatic language, code switching, tokenisation and truncation issues, domain mismatch.

Metrics

Evaluation Metrics

BERTScore (primary): Uses contextualised embeddings to measure semantic similarity between the translation and the gold standard. For each token in the candidate, finds the best-matching token in the reference by cosine similarity (greedy matching), then computes F1 from precision and recall. Robust to paraphrase and morphological variation — substantially more reliable than BLEU for Arabic, Turkish, Urdu, and other morphologically rich languages.

mBERTScore: Multilingual variant applied when comparing source text directly against translation, without a monolingual gold standard. Used to assess semantic preservation across languages.

METEOR (primary): Combines precision and recall with stemming, synonymy, and word order alignment. Handles paraphrases and partial matches — better calibrated to semantic accuracy than BLEU for most translation tasks. Primary metric when the goal is capturing the general content and context of source text rather than exact matches.

BLEU / chrF (secondary): Retained as secondary signals and for comparison with literature. BLEU provides n-gram precision baseline; chrF is more sensitive for morphological variation. Divergences between BERTScore and BLEU often point to fluency issues that semantic metrics alone do not surface.

Tooling

Models & Implementation

Translation models: Model selection is language-specific. Where a dedicated one-to-one model exists for the target language pair (Helsinki-NLP OPUS family), it is evaluated first — these models are smaller and faster, making them more viable for CPU-only deployments. Many-to-many models (e.g. mbart-large-50) provide broader language coverage but are slower and require GPU for production-scale runs.

Computational benchmarking: Each model is timed at multiple batch sizes in both CPU and GPU settings. The difference is substantial — a dataset of 100,000 records that takes 5–23 days on CPU takes approximately 5 hours on an A100 GPU. Computational feasibility is assessed alongside translation quality before a model is recommended for a project.

Evaluation pipeline: METEOR computed via nltk.translate.meteor_score. BERTScore via the bert_score library. Error analysis samples drawn across score distribution quartiles for manual review.

Applied in

Used in multilingual projects where cross-lingual analysis required a principled approach to model selection and translation quality assessment.

Tech Stack

Python HuggingFace Transformers bert_score NLTK METEOR mBERTScore sacrebleu PyTorch Pandas