Data Engineering
Social Media Data Collection
Managing continuous, multi-platform collection pipelines across Brandwatch, Telegram, YouTube, Twitter, and CrowdTangle — feeding a centralised data warehouse for downstream NLP analysis
Developed at CASM Technology · Senior Data Scientist
Context
The Data Layer Behind the Analysis
Every topic model, NER extraction, and influence network analysis starts with data. For broadcast and social media monitoring projects, that data has to arrive continuously and reliably — any gap in collection creates a blind spot in the analysis, and retrospective collection is often impossible once the window has passed.
This work involved owning the full collection infrastructure: designing and maintaining queries across multiple platforms, ingesting data into a centralised data warehouse on a daily basis, managing quotas, and keeping pipelines healthy as APIs changed, platforms deprecated features, and project requirements evolved.
It is the layer that is invisible when it works and catastrophic when it doesn't — and the reason downstream analysis could proceed with confidence in the completeness of the data.
Operations
What Daily Management Involves
Collection pipelines require ongoing management beyond the initial setup. Each platform has rate limits, daily quotas, and evolving API behaviour that requires active monitoring. Key responsibilities included:
- Quota management — ensuring daily collection targets are met without exceeding platform limits, balancing coverage across sources when quota is shared
- Blacklist and allowlist curation — iteratively identifying and filtering noise sources (spam accounts, irrelevant channels, bot networks) while ensuring legitimate sources remain in scope
- Query maintenance — updating boolean queries and keyword lists as project scope evolves or new topics emerge, without breaking historical continuity
- API change management — adapting pipelines as platforms deprecate features or change access models (notably the CrowdTangle shutdown and Twitter API pricing restructure)
Collection Platforms
Five distinct collection channels, each with its own API behaviour, data format, and operational characteristics. All pipelines feed into a centralised data warehouse for storage, deduplication, and downstream processing.
| Platform | Access method | What it provides | Notes |
|---|---|---|---|
| Brandwatch | Platform queries + downloads | Twitter/X, Facebook, Instagram — keyword and boolean query results at scale | Primary collection tool; daily download management and ingestion into central warehouse |
| Telegram API | Direct API | Channel and group posts, metadata, forwarding chains | Critical for monitoring influence operations and broadcast channels not covered by social platforms |
| YouTube Data API | Direct API | Video metadata, transcripts, comments, channel data | Quota-limited; daily unit budget management required to maintain continuous coverage |
| Twitter / X API | Direct API | Tweets, user metadata, engagement, network data | Used directly alongside Brandwatch; significantly impacted by the 2023 API restructure and pricing changes |
| CrowdTangle | Direct API | Facebook and Instagram public content, engagement metrics, page and group data | Meta deprecated CrowdTangle in 2024; pipelines required migration ahead of shutdown |
Pipeline Architecture
All collection pipelines converge on a centralised data warehouse that serves as the single source of truth for downstream analysis. The warehouse handles deduplication across sources, normalises data formats, and provides the structured interface that NLP pipelines consume.
Collection
Brandwatch downloads
Telegram API
YouTube API
Twitter API
CrowdTangle
Centralised Warehouse
Ingestion & deduplication
Format normalisation
Daily quota tracking
Source management
Downstream
Topic modelling
NER extraction
Network analysis
Client reporting
Cross-platform Capabilities
Two operational patterns that apply consistently across all collection APIs, regardless of platform.
| Capability | What it means | Why it matters |
|---|---|---|
| Cron scheduling | Automated, time-triggered pipeline runs — daily collection across all platforms executed without manual intervention | Continuous collection requires no human touch for routine runs; failures surface as gaps in the warehouse rather than missed collections going unnoticed |
| Backsearch | Historical data retrieval — querying each platform's archive to retrieve data from before a pipeline was active, or to fill gaps after an outage | New project queries can be initialised with months of historical context; quota-limited platforms (YouTube, Twitter) require careful backfill planning to avoid exhausting limits on one-time bulk requests |