← Back to Portfolio

The Project

Context & Objectives

A joint research project between ISD (Institute for Strategic Dialogue) and CASM Technology, funded by the European Media and Information Fund. The study investigated how pro-Kremlin, conspiracy, far-right and far-left communities interact on Telegram across France, Germany and Italy — mapping influence networks, identifying key actors, and tracking how narratives spread over two years (January 2022 – February 2024).

Published as "Behind the Curtain: Unveiling Pro-Kremlin Ecosystems in Europe" (ISD, 2025).

My Role

Senior Data Scientist

Co-designed the analytical methodology — including the decision to use a multilingual topic model rather than per-language models, to preserve cross-community overlap. Responsible for the sampling strategy at scale (account-stratified, Vaex and Parquet for large data processing), building the topic model, and designing a granular annotation scheme used across all three language teams.

Coordinated multilingual annotation — running 1:1 meetings with each language team, tracking progress using entropy-based ambiguity metrics, and facilitating consolidated sessions where teams aligned themes across languages.

Applied KNN classification (backed by ChromaDB) to extend thematic labels from the annotated sample to the full 7.7M post corpus — enabling the cross-community overlap analysis that was the primary research objective. Adapted and extended the Streamlit reporting dashboard to suit the analytical questions. Contributed to the methodology section of the published report.

Scope

Project Scope

  • Platform: Telegram
  • Countries: France, Germany, Italy
  • Languages: French, German, Italian (+ English-language channels)
  • Period: January 2022 – February 2024
  • Channels: 1,162 seed channels across 4 community types — pro-Kremlin, conspiracy, far-right, far-left
  • Posts collected: 7,784,669
  • Posts classified: 4,936,025 (after deduplication and sampling)
  • Taxonomy: 22 themes, 113 subthemes

Method

Approach & Pipeline

Given the scale of the data (7.7M posts across three languages), a stratified sample of ~500k messages was drawn first — capped at 300 messages per account to maintain broad coverage. The sample was processed with paraphrase-multilingual-mpnet-base-v2 for embeddings, then UMAP for dimensionality reduction and HDBSCAN for clustering, producing 162 raw clusters.

Language teams reviewed samples from each cluster using a structured annotation scheme, producing a taxonomy of 22 themes and 113 subthemes. After annotation, KNN classification (via ChromaDB) was applied to the full corpus to extend the thematic labels across all 4.9M deduplicated posts — enabling the cross-community overlap analysis that was the primary research objective.

Outcomes

Results & Impact

Multilingual topic modelling generally produced coherent, meaningful clusters across languages. A subset of clusters required further attention — some formed around language-specific artefacts (e.g. driven by a single German term rather than a broader concept), others were too dense to be useful as-is and needed unpacking, or overlapped in ways better represented as a single consolidated theme. Contrastive learning provided a mechanism to reorient these clusters and improve semantic coherence.

KNN classification was applied using the annotated topic model clusters as labelled exemplars — querying the full 7.7M post corpus against these to assign thematic labels at scale, achieving F1: 0.87. The granular annotation scheme provided a well-organised tracking framework across three language teams and 162 clusters, with entropy-based ambiguity metrics used to monitor progress and prioritise review.

Analytical findings are published in the public report — see link below.

Tech Stack

Python paraphrase-multilingual-mpnet-base-v2 UMAP HDBSCAN ChromaDB KNN Vaex Parquet Streamlit Pandas