NER Evaluation — Ahmed Younes

← Back to Methods

Overview

Context & Challenge

The projects that drove this workflow spanned many languages across a wide resource spectrum — from high-resource languages with established evaluation datasets to severely low-resource African and Central Asian languages where finding any benchmarking data was itself part of the challenge. The challenge was twofold: first, discovering what fine-tuned transformer models even existed for each language on HuggingFace; second, finding what datasets and benchmarks existed to evaluate them against. For low-resource African and Central Asian languages, neither was a given.

This needed to move fast — languages were scoped and evaluated in weekly iterations, not months-long research cycles. The workflow was built to make that rapid, repeatable: scope the landscape for a new language, benchmark the candidate models against whatever evaluation data existed, then test on real project data. In most cases, the top benchmark model was also the strongest on real project data — but not always. Several lower-ranked models proved better aligned with the actual data distribution, which benchmark scores alone would not have predicted.

Approach

Three-Stage Evaluation

Stage 1 — Model identification: Candidate models sourced from HuggingFace model hub and leaderboards for each target language. For low-resource languages, cross-lingual transfer models (XLM-RoBERTa, multilingual BERT) are evaluated alongside any language-specific models available. Licensing and annotation scheme compatibility checked at this stage.

Stage 2 — Benchmark evaluation: Each candidate evaluated against standard datasets (CoNLL-2003, Wikiann, MasakhaNER, language-specific corpora). Evaluation uses seqeval for entity-level F1 (PER, LOC, ORG, MISC) and sklearn for token-level metrics. Results tabulated for ranked comparison per entity type.

Stage 3 — Project-specific validation: Top-ranked models applied to a sample of actual project data and reviewed using the NER validation tool. Catches domain mismatch that benchmark scores obscure — a model scoring well on news may perform poorly on social media or broadcast transcripts.

Coverage

Language Scope

The framework was designed to work across the full resource spectrum. Stage 1 begins by scoping what models and benchmarks actually exist for a given language — never assuming they do.

High-resource (e.g. English, French, German) — established benchmarks, multiple fine-tuned options, rich evaluation datasets. Model selection is competitive rather than exploratory.

Medium-resource (e.g. Arabic, Turkish, Indonesian) — reasonable model coverage, some language-specific benchmarks, more variation in annotation quality and domain fit.

Low-resource (e.g. Hausa, Twi, Xhosa) — sparse fine-tuned options, evaluation often limited to Wikiann or MasakhaNER, cross-lingual transfer models (XLM-RoBERTa, AfroxLMR) evaluated as the primary candidates.

Arabic is the primary worked example in the repo — benchmark evaluation notebook, extraction pipeline run, and validation tool output included.

Tooling

Benchmarking, Extraction & Validation

Benchmarking evaluation — each candidate model is first checked for license and then assessed against available datasets for that language — identifying which benchmark datasets exist and whether the model's annotation scheme aligns with them. Labels are mapped to a standard scheme (PER, LOC, ORG, MISC) before evaluation runs. Results are produced at two levels: entity span (seqeval F1 by entity type) and token level (sklearn), giving a ranked comparison across models.

Extraction pipeline — the NamedEntityExtractions class runs the selected model on unlabelled project data at scale using HuggingFace pipelines with batched inference, subword token merging, and structured JSON output.

Validation app — a Dash-based tool for qualitative review of extraction outputs. Displays colour-coded entity spans, lets reviewers annotate mistakes and missing entities, and captures error type (entity type mismatch, boundary error, truncation) to JSON.

Applied in

Applied in projects involving multilingual social media analysis, mainstream media monitoring, and broadcast transcript processing across a range of languages and domains.

Tech Stack

Python HuggingFace Transformers HuggingFace datasets seqeval scikit-learn Dash PyTorch Pandas