← Back to Methods

Context

From Raw STT to Structured Data

Broadcast media monitoring at scale produces a continuous stream of speech-to-text (STT) transcripts — one JSON file per channel per broadcast window, across multiple languages. The raw output from a provider like AWS Transcribe includes the full transcript text, word-level timing, confidence scores, and speaker diarisation labels, but it is not in a form amenable to analysis: speakers are identified only by generic labels, word items are not yet attributed to turns, and there is no structure that maps entities to speakers or detects when the same broadcast appears more than once.

This pipeline was developed across several broadcast monitoring projects, initially for Arabic-language channels and later extended to Indonesian and English. The same core processing logic applies regardless of language; language-specific handling (such as Arabic prefix merging) is layered on top of the shared base.

Parsing

Diarisation and Speaker Attribution

The STT JSON format separates word-level items (with start/end timestamps and confidence scores) from speaker label segments (which identify who was speaking across each time window). Attributing words to speakers requires aligning these two structures by timestamp — a word item is assigned to the speaker segment that contains its time range.

Each speaker turn is then reconstructed as a sentence: the word items for that segment are joined in sequence, punctuation is inserted where present, and the result is stored alongside channel name, filename, broadcast time window, speaker label, and segment start/end times. This produces one row per speaker turn — a format that downstream NER extraction and dashboard filtering can operate on directly.

For Arabic, AWS Transcribe frequently introduces a space between proclitic prefixes (و, ف, ب, ل, ال, and compound forms like وال, فال, بال) and the word they attach to. A regex merge step removes these spaces before the sentence is finalised, preventing incorrect tokenisation in downstream NLP.

Extraction

Named Entity Recognition

Entity extraction runs over the structured speaker sentence data using a HuggingFace token classification model. The model is run in batch mode over the speaker sentences for each channel, with a configurable batch size and maximum sequence length. Outputs are decoded using the model's label vocabulary (B-I-O tagging scheme), with subword tokens merged back into full entity spans.

Post-processing collapses B- and I- tagged tokens into a single entity record with a text string and label. An optional label remapping allows project-specific entity type normalisation. The final output is an entities column on the speaker sentence DataFrame, with each row holding a list of extracted entities — and a parallel entity_schema column giving a dict of entity type to unique values for that sentence.

NER extraction runs as a separate processing step and is model-agnostic: any HuggingFace token classification model can be used, making it straightforward to swap in a language-appropriate NER model per channel.

Quality

Duplicate Detection

Broadcast monitoring data frequently contains near-duplicate transcripts: the same broadcast captured from multiple sources, overlapping recording windows, or repeated segments. Deduplication needs to be fuzzy rather than exact — minor transcription differences or partial overlaps mean that exact hashing on the full transcript text misses most true duplicates.

The duplicate detector uses rapidfuzz token-based string similarity to compare transcripts pairwise within a channel. Any pair above a configurable similarity threshold (default 80%) is flagged as a candidate duplicate. The result is a report that annotates each transcript with the list of transcripts it is near-identical to, which the analyst can use to decide whether to deduplicate before downstream processing.

A content hash (MD5 of the full transcript text) is generated at parse time, providing a fast exact-match signal before the more expensive pairwise comparison runs.

Pipeline

Four sequential stages. Parse and Extract run on the raw data directory; Detect runs on the parse outputs; Explore reads the processed outputs from the dashboard.

Stage Class / Module What it does
Parse Channel / Transcript / Speaker Reads .stt.json files, aligns word items to speaker segments by timestamp, reconstructs speaker sentences with Arabic prefix merging, writes structured CSVs to outputs/<ChannelName>/.
Extract NamedEntityExtractions Runs a HuggingFace token classification model over speaker sentences in batch. Merges B-I-O spans into entity records. Adds entities and entity_schema columns to the parsed DataFrame.
Detect DuplicateDetector Pairwise rapidfuzz similarity across all transcripts within a channel. Flags pairs above threshold as candidates. Returns a report DataFrame with a duplicates column listing near-identical transcripts per row.
Explore dashboard/Load.py Streamlit dashboard. Reads parsed CSVs from outputs/. Provides per-channel/transcript/speaker views and entity filtering. Start with streamlit run dashboard/Load.py.

Input format.stt.json files produced by AWS Transcribe (or a compatible provider), with filenames encoding the broadcast window: "2024-06-11 07-00-00-000 to 2024-06-11 08-00-00-000.stt.json". Channel subfolders under data/ determine how transcripts are grouped.

Dashboard

A three-page Streamlit dashboard (streamlit run dashboard/Load.py) built on an MVC architecture. Reads processed CSVs from outputs/ — run the parse stage first to populate them. Supports Iraqi Arabic (ar-IQ), Indonesian (id-ID), and English (en-GB), as well as any other language covered by the chosen NER model.

Page What it shows
Load Top-level metrics: total channels, transcripts, sentences, and average words per sentence. Per-channel expanders with individual transcript/sentence/speaker counts.
Speaker Data Select a channel, then a specific transcript. Shows all diarised speaker turns in sequence with speaker labels. A statistics panel shows the sentence and word count breakdown per speaker.
NER Analysis Multi-channel, multi-transcript, and multi-speaker selection. Entity type filter (PER, ORG, LOC, etc.) and free-text search. Two tabs: top entity frequency distribution across the selection; speaker data with in-line entity highlighting. Populated once NER extraction has been run.
Dashboard Load page — 3 channels, 35 sentences, channel summaries

Load — channel metrics overview

Speaker Data page — diarised speaker segments for an English broadcast

Speaker Data — diarised turns per transcript

NER Analysis page — channel/transcript/entity type filters

NER Analysis — entity filters across channels

Languages

The dummy dataset in the repository covers three language tracks. Arabic includes prefix merging; Indonesian and English run through the standard pipeline without language-specific post-processing.

Language STT lang code Notes
Iraqi Arabic ar-IQ Regex prefix merging applied (و، ف، ب، ل، ال and compound forms)
Indonesian id-ID Standard pipeline; NER model selected per deployment
English en-GB Standard pipeline; NER model selected per deployment

Tech Stack

Python HuggingFace Transformers PyTorch pandas rapidfuzz Streamlit AWS Transcribe (STT format) Arabic NLP Speaker Diarisation Named Entity Recognition