← Back to Portfolio

The Project

Context & Objectives

An engagement with BBC Monitoring to map China's diplomatic social media activity across Twitter and Facebook — 432,800 messages collected across 372 diplomatic accounts (2019–2020), with 102,883 of those messages classified across 9 themes, 4 languages and 6 global regions (2021).

The aim was to give BBC Monitoring a systematic, scalable view of how Chinese diplomatic accounts framed narratives across regions and platforms — from COVID-19 and geopolitics to human rights and technology.

My Role

Contributions

Phase 1 (2019–2020): restructured and standardised the account seed list into a unified collection schema across 372 diplomatic accounts (13 handle/link types). Set up and maintained the data collection pipeline — cron-based scheduling for ongoing ingestion and historic backfill, collecting 432,800 messages between October 2019 and December 2020. Led the keyword-based thematic pilot across all four languages to establish an agreed thematic breakdown. Named contributor on the published BBC Monitoring report.

Phase 2 (2021): contributed to training and evaluating 34 binary classifiers across 4 languages and 9 themes. Supported annotation coordination across language specialist teams, running active learning cycles to iteratively improve model quality and refining the pipeline throughout.

Scope

Project Scope

Phase 1 — Data Collection (Oct 2019 – Dec 2020)

  • Accounts mapped: 372 diplomatic accounts across 13 handle/link types (embassies, consulates, ambassadors, press officers)
  • Messages collected: 432,800
  • Languages: English, Arabic, French, Spanish
  • Pipeline: cron-based scheduling for ongoing ingestion and historic backfill

Phase 2 — Classification (2021)

  • Posts classified: 102,883 across Twitter and Facebook
  • Regions: Asia-Pacific, Africa, Americas, Europe, Middle East, Eurasia
  • Themes: 9 (Geopolitics, Economy, COVID-19, Politics & Society, Culture & People, Military & Security, Technology, Environment, Human Rights)
  • Classifiers trained: 34 binary classifiers (9 themes × 4 languages)
  • Evaluation: manually annotated gold-standard datasets per language and theme

Method

Approach & Pipeline

Phase 1 — Keyword Discovery (2019–2020): account seed list compiled and unified into a consistent collection schema across 372 accounts and 13 diplomatic handle types. Keyword-based thematic pilot run across all four languages to agree on a thematic framework before moving to supervised classification.

Phase 2 — ML Classification (2021): 9 themes × 4 languages = 34 binary classifiers, each trained on manually annotated gold-standard data. Annotation coordinated across four language specialist teams with defined guidelines and positive/negative example documentation. Active learning cycles used to iteratively surface high-value training examples and improve performance. Pipeline maintained and refined throughout.

Outcomes

Results & Impact

  • Average F1: 80.5% across 34 classifiers — consistent performance across 4 languages and 9 themes
  • Coverage: 102,883 posts classified across 6 global regions, spanning English, Arabic, Spanish and French
  • Published report: "China's Public Diplomacy on Twitter and Facebook" — BBC Monitoring, May 2022

Tech Stack

Python scikit-learn Twitter API CrowdTangle Pandas Plotly