Analytical Toolkit
Methods
Each method documented here is a workflow or tool I personally designed and built — developed iteratively across research and consulting engagements, not just applied.
Delivery Methodology
Stakeholder Communication Framework
A four-stage protocol for presenting analytical findings to mixed technical and non-technical audiences — anchor to the decision, show the funnel, translate outputs to business meaning, and engineer next steps that build momentum.
View method →Documentation Project
Topic Modelling Recipes
A practitioner's guide accumulated across multiple projects — methodological checklists, stage-specific investigations, and documented decisions from running topic modelling in production across languages and client contexts.
View project →Text Analysis
Topic Allocation Workflow
Five-stage pipeline for large-scale narrative discovery — sentence-transformer embeddings, UMAP, and HDBSCAN at its core. Adapts to qualitative exploration and quantitative classification, with multilayered and guided variants.
View method →Research Investigation
Outlier Mitigation in HDBSCAN
Empirical investigation into the HDBSCAN -1 cluster — which consistently captures 40–60% of data. Compares soft clustering (membership vectors) and k-means as strategies for recovering analytically valuable fringe content.
View method →Text Analysis
Granular Annotation Scheme
Entropy-based stopping criteria for topic annotation. Replaces fixed-sample characterisation with proportional sampling and per-message description, using Shannon entropy to determine when a topic's description has stabilised sufficiently to stop.
View method →Text Analysis
Topic Model Evaluation
Review vs blind evaluation of thematic allocation — when each is appropriate, how anchoring bias emerges, and a hybrid approach that splits the evaluation sample into a review subset and a blind test subset.
View method →Research Investigation
Translation for Topic Modelling
Empirical comparison of clustering on source text vs translated text — three setups on Arabic-English data showing homogeneity tradeoffs, when each approach is appropriate, and what translation quality means for cluster coherence.
View method →Text Analysis
Guided Topic Modelling
Fine-tuning sentence transformer embeddings with contrastive learning to steer clustering toward predefined analytical objectives — bridging unsupervised discovery and analytically defined frameworks.
View method →Classification
Contrastive Fine-tuning for Classification
Reshaping the sentence transformer embedding space with contrastive learning to improve k-NN classification accuracy on population-scale annotated data — a standalone evaluation experiment with a structured pipeline and comparison dashboard.
View method →Text Analysis
Multilayered Topic Modelling
Iterative clustering passes applied to heterogeneous topics — Layer 1 produces a broad thematic breakdown, subsequent layers dissect heterogeneous topics into sub-narratives discovered from the data rather than declared by classifiers.
View method →Classification
Classification Approaches
Three strategies for assigning categories to documents at scale — zero-shot NLI, exemplar-based k-NN, and keyword and ML-based classifiers — and a framework for choosing between them based on available labels and accuracy requirements.
View method →Information Extraction
Broadcast Transcript Analysis
Pipeline for parsing multilingual broadcast STT transcripts — speaker diarisation, Arabic prefix merging, HuggingFace NER extraction, and fuzzy duplicate detection — with a Streamlit dashboard for exploration across Iraqi Arabic, Indonesian, and English channels.
View method →Data Engineering
Social Media Data Collection
Multi-platform collection infrastructure managing continuous ingestion pipelines across Brandwatch, Telegram, YouTube, Twitter/X, and CrowdTangle — with daily quota management, allowlist curation, and automated ingestion into a centralised data warehouse for downstream NLP analysis.
View method →Information Extraction
Multilingual NER Evaluation
Three-stage benchmarking framework for selecting and validating NER models across 26 languages, from high-resource to low-resource. Separates benchmark performance from project-specific domain validation.
View method →Cross-lingual NLP
Machine Translation Evaluation
Comparative framework using BERTScore as the primary quality signal alongside METEOR. Includes structured error analysis covering named entity preservation, tokenisation, domain shift, and idiomatic language across morphologically rich languages.
View method →