← Back to Portfolio

The Project

Context & Motivation

DeformAr is the primary output of my PhD in Natural Language Processing at the University of Sussex (2025). It is a diagnostic framework for evaluating Named Entity Recognition (NER) systems — designed to go beyond standard aggregate metrics like F1 score and expose the fine-grained, component-level failure modes that aggregate scores conceal.

The framework was developed and validated across Arabic and English — two typologically different languages that present distinct challenges for NER systems. Arabic is morphologically complex, right-to-left, and has significant dialectal variation. Evaluating across both languages surfaces insights that single-language evaluation misses.

The Problem

What DeformAr Solves

Standard NER evaluation reports a single F1 score. This tells you how well a model performs overall — but not why it fails, where it fails, or what kind of errors it makes. A model that misses all person names but correctly identifies all locations would score the same as a model that makes random errors everywhere.

For practitioners who need to choose, deploy or improve NER systems, this is not enough information. DeformAr provides a multi-dimensional diagnostic lens so that model selection and improvement decisions can be made on evidence rather than a single number.

Framework

Components & Design

  • Entity-type analysis — separate evaluation per class (PER, LOC, ORG, MISC)
  • Boundary detection — distinguishing span boundary errors from entity type errors
  • Token-level diagnostics — identifying which tokens consistently cause failures
  • Domain sensitivity — how performance shifts across news, social media and formal text
  • Cross-lingual transfer — how models trained on one language perform on another
  • Comparison framework — structured multi-model comparison on the same diagnostic dimensions

Outcomes

Findings & Impact

DeformAr surfaces that the majority of Arabic NER errors are boundary-related rather than type errors — a finding obscured by aggregate F1 scores and with direct implications for model fine-tuning strategy.

The diagnostic dashboard enables practitioners to identify model weaknesses on specific entity types and domains, making model selection evidence-based rather than benchmark-driven.

The NER evaluation framework developed for DeformAr has been applied in applied client work across FAST, BBC Monitoring and other multilingual NLP projects.

Tech Stack

Python PyTorch HuggingFace Transformers Arabic NLP seqeval Streamlit Pandas Plotly

Education

  • PhD in Natural Language Processing — University of Sussex 2025
  • MSc in Data Science (Distinction) 2019
  • BSc in Computer Science (First Class) 2016