PhD Doctoral Output · University of Sussex · 2025
DeformAr
A Diagnostic Framework for NER Evaluation
Published · 2025Repositories
The Project
Context & Motivation
DeformAr is the primary output of my PhD in Natural Language Processing at the University of Sussex (2025). It is a diagnostic framework for evaluating Named Entity Recognition (NER) systems — designed to go beyond standard aggregate metrics like F1 score and expose the fine-grained, component-level failure modes that aggregate scores conceal.
The framework was developed and validated across Arabic and English — two typologically different languages that present distinct challenges for NER systems. Arabic is morphologically complex, right-to-left, and has significant dialectal variation. Evaluating across both languages surfaces insights that single-language evaluation misses.
The Problem
What DeformAr Solves
Standard NER evaluation reports a single F1 score. This tells you how well a model performs overall — but not why it fails, where it fails, or what kind of errors it makes. A model that misses all person names but correctly identifies all locations would score the same as a model that makes random errors everywhere.
For practitioners who need to choose, deploy or improve NER systems, this is not enough information. DeformAr provides a multi-dimensional diagnostic lens so that model selection and improvement decisions can be made on evidence rather than a single number.
Framework
Components & Design
- Entity-type analysis — separate evaluation per class (PER, LOC, ORG, MISC)
- Boundary detection — distinguishing span boundary errors from entity type errors
- Token-level diagnostics — identifying which tokens consistently cause failures
- Domain sensitivity — how performance shifts across news, social media and formal text
- Cross-lingual transfer — how models trained on one language perform on another
- Comparison framework — structured multi-model comparison on the same diagnostic dimensions
Outcomes
Findings & Impact
DeformAr surfaces that the majority of Arabic NER errors are boundary-related rather than type errors — a finding obscured by aggregate F1 scores and with direct implications for model fine-tuning strategy.
The diagnostic dashboard enables practitioners to identify model weaknesses on specific entity types and domains, making model selection evidence-based rather than benchmark-driven.
The NER evaluation framework developed for DeformAr has been applied in applied client work across FAST, BBC Monitoring and other multilingual NLP projects.
Tech Stack
Education
- PhD in Natural Language Processing — University of Sussex
- MSc in Data Science (Distinction)
- BSc in Computer Science (First Class)
Further Links
- GitHub Profile ↗
- Thesis — available on request