← Back to Methods

Context

The Data Layer Behind the Analysis

Every topic model, NER extraction, and influence network analysis starts with data. For broadcast and social media monitoring projects, that data has to arrive continuously and reliably — any gap in collection creates a blind spot in the analysis, and retrospective collection is often impossible once the window has passed.

This work involved owning the full collection infrastructure: designing and maintaining queries across multiple platforms, ingesting data into a centralised data warehouse on a daily basis, managing quotas, and keeping pipelines healthy as APIs changed, platforms deprecated features, and project requirements evolved.

It is the layer that is invisible when it works and catastrophic when it doesn't — and the reason downstream analysis could proceed with confidence in the completeness of the data.

Operations

What Daily Management Involves

Collection pipelines require ongoing management beyond the initial setup. Each platform has rate limits, daily quotas, and evolving API behaviour that requires active monitoring. Key responsibilities included:

  • Quota management — ensuring daily collection targets are met without exceeding platform limits, balancing coverage across sources when quota is shared
  • Blacklist and allowlist curation — iteratively identifying and filtering noise sources (spam accounts, irrelevant channels, bot networks) while ensuring legitimate sources remain in scope
  • Query maintenance — updating boolean queries and keyword lists as project scope evolves or new topics emerge, without breaking historical continuity
  • API change management — adapting pipelines as platforms deprecate features or change access models (notably the CrowdTangle shutdown and Twitter API pricing restructure)

Collection Platforms

Five distinct collection channels, each with its own API behaviour, data format, and operational characteristics. All pipelines feed into a centralised data warehouse for storage, deduplication, and downstream processing.

Platform Access method What it provides Notes
Brandwatch Platform queries + downloads Twitter/X, Facebook, Instagram — keyword and boolean query results at scale Primary collection tool; daily download management and ingestion into central warehouse
Telegram API Direct API Channel and group posts, metadata, forwarding chains Critical for monitoring influence operations and broadcast channels not covered by social platforms
YouTube Data API Direct API Video metadata, transcripts, comments, channel data Quota-limited; daily unit budget management required to maintain continuous coverage
Twitter / X API Direct API Tweets, user metadata, engagement, network data Used directly alongside Brandwatch; significantly impacted by the 2023 API restructure and pricing changes
CrowdTangle Direct API Facebook and Instagram public content, engagement metrics, page and group data Meta deprecated CrowdTangle in 2024; pipelines required migration ahead of shutdown

Pipeline Architecture

All collection pipelines converge on a centralised data warehouse that serves as the single source of truth for downstream analysis. The warehouse handles deduplication across sources, normalises data formats, and provides the structured interface that NLP pipelines consume.

Collection

Brandwatch downloads
Telegram API
YouTube API
Twitter API
CrowdTangle

Centralised Warehouse

Ingestion & deduplication
Format normalisation
Daily quota tracking
Source management

Downstream

Topic modelling
NER extraction
Network analysis
Client reporting

Cross-platform Capabilities

Two operational patterns that apply consistently across all collection APIs, regardless of platform.

Capability What it means Why it matters
Cron scheduling Automated, time-triggered pipeline runs — daily collection across all platforms executed without manual intervention Continuous collection requires no human touch for routine runs; failures surface as gaps in the warehouse rather than missed collections going unnoticed
Backsearch Historical data retrieval — querying each platform's archive to retrieve data from before a pipeline was active, or to fill gaps after an outage New project queries can be initialised with months of historical context; quota-limited platforms (YouTube, Twitter) require careful backfill planning to avoid exhausting limits on one-time bulk requests

Tech Stack

Brandwatch Telegram API YouTube Data API Twitter / X API CrowdTangle Python REST APIs cron scheduling backsearch pipeline management data engineering