Social Media Data Collection

← Back to Methods

Context

The Data Layer Behind the Analysis

Every topic model, NER extraction, and influence network analysis starts with data. For broadcast and social media monitoring projects, that data has to arrive continuously and reliably — any gap in collection creates a blind spot in the analysis, and retrospective collection is often impossible once the window has passed.

This work involved owning the full collection infrastructure: designing and maintaining queries across multiple platforms, ingesting data into a centralised data warehouse on a daily basis, managing quotas, and keeping pipelines healthy as APIs changed, platforms deprecated features, and project requirements evolved.

It is the layer that is invisible when it works and catastrophic when it doesn't — and the reason downstream analysis could proceed with confidence in the completeness of the data.

Operations

What Daily Management Involves

Collection pipelines require ongoing management beyond the initial setup. Each platform has rate limits, daily quotas, and evolving API behaviour that requires active monitoring. Key responsibilities included:

Quota management — ensuring daily collection targets are met without exceeding platform limits, balancing coverage across sources when quota is shared
Blacklist and allowlist curation — iteratively identifying and filtering noise sources (spam accounts, irrelevant channels, bot networks) while ensuring legitimate sources remain in scope
Query maintenance — updating boolean queries and keyword lists as project scope evolves or new topics emerge, without breaking historical continuity
API change management — adapting pipelines as platforms deprecate features or change access models (notably the CrowdTangle shutdown and Twitter API pricing restructure)

Collection Platforms

Five distinct collection channels, each with its own API behaviour, data format, and operational characteristics. All pipelines feed into a centralised data warehouse for storage, deduplication, and downstream processing.

Platform	Access method	What it provides	Notes
Brandwatch	Platform queries + downloads	Twitter/X, Facebook, Instagram — keyword and boolean query results at scale	Primary collection tool; daily download management and ingestion into central warehouse
Telegram API	Direct API	Channel and group posts, metadata, forwarding chains	Critical for monitoring influence operations and broadcast channels not covered by social platforms
YouTube Data API	Direct API	Video metadata, transcripts, comments, channel data	Quota-limited; daily unit budget management required to maintain continuous coverage
Twitter / X API	Direct API	Tweets, user metadata, engagement, network data	Used directly alongside Brandwatch; significantly impacted by the 2023 API restructure and pricing changes
CrowdTangle	Direct API	Facebook and Instagram public content, engagement metrics, page and group data	Meta deprecated CrowdTangle in 2024; pipelines required migration ahead of shutdown

Pipeline Architecture

All collection pipelines converge on a centralised data warehouse that serves as the single source of truth for downstream analysis. The warehouse handles deduplication across sources, normalises data formats, and provides the structured interface that NLP pipelines consume.

Collection

Brandwatch downloads
Telegram API
YouTube API
Twitter API
CrowdTangle

Centralised Warehouse

Ingestion & deduplication
Format normalisation
Daily quota tracking
Source management

Downstream

Topic modelling
NER extraction
Network analysis
Client reporting

Cross-platform Capabilities

Two operational patterns that apply consistently across all collection APIs, regardless of platform.

Capability	What it means	Why it matters
Cron scheduling	Automated, time-triggered pipeline runs — daily collection across all platforms executed without manual intervention	Continuous collection requires no human touch for routine runs; failures surface as gaps in the warehouse rather than missed collections going unnoticed
Backsearch	Historical data retrieval — querying each platform's archive to retrieve data from before a pipeline was active, or to fill gaps after an outage	New project queries can be initialised with months of historical context; quota-limited platforms (YouTube, Twitter) require careful backfill planning to avoid exhausting limits on one-time bulk requests

Tech Stack

Brandwatch Telegram API YouTube Data API Twitter / X API CrowdTangle Python REST APIs cron scheduling backsearch pipeline management data engineering