← Back to Portfolio

The Project

Context & Objectives

Emotech builds voice AI systems for customer service — an LLM-powered conversational engine (Emoflow) handling order queries, task resolution, delegation, and cross-lingual interaction across English and Arabic dialects. The system had 3 conversational flows when I joined. Manual testing was not producing reliable quality signals: bugs were inconsistently reproduced, quality variation across versions was invisible, and findings had no structured path to the right stakeholder.

I was brought in as Applied LLM Researcher — a multidisciplinary role covering LLM research, software development, QA design, and cross-team coordination between the product manager, engineering team, and TTS/STT specialists. My work covered the full evaluation lifecycle and a parallel architectural research track on Rasa CALM.

My Role

Applied LLM Researcher — Multidisciplinary

The engagement required operating across multiple teams simultaneously. I read product requirements and use cases from Confluence and Jira, worked directly with the PM to understand what needed testing and why, liaised with the engineering team on the technical feasibility of test designs, and supported the TTS/STT team with their own validation work on Arabic dialects and English.

On the engineering side, an initial prototype existed — GPT driving conversations with Emoflow over WebSocket. I built the complete library around it: the automated reporting system, versioned folder management, configuration layer, runner, and parallel-processing WebSocket client (using daemon threads, resolving a thread-sticking bug in the single-processing version). I also mapped the full Rasa CALM codebase to produce the architectural documentation that grounded the migration research.

Evaluation System

LLM-as-a-Judge Pipeline

The core system used GPT to both simulate user behaviour and evaluate chatbot responses in the same pipeline. A WebSocket client drove live conversations between GPT (acting as a user) and Emoflow — with GPT-generated prompts targeting specific failure modes across three test types:

  • Automation tests — design and flow compliance (did the conversation follow the intended path?)
  • Bug reproduction tests — design bugs (conversation design errors), logic bugs (flow execution errors), and quality issues — each routed to the right owner
  • Feature tests — whether newly implemented features behaved as specified

Each conversation was evaluated by GPT-4o-mini against structured behavioural criteria: unexpected behaviour (premature termination, unnecessary delegation, restarting), delegation necessity, and task resolution. Findings were classified by type and automatically routed to the relevant stakeholder — PM, developer, or manual testing queue. Results were logged as structured JSON and compiled into versioned summary reports and a live Google Sheet updated after each run cycle.

Iteration

Evaluation Methodology: Three Stages

Stage 1 — Fully automated: GPT used for both scenario generation and evaluation simultaneously. This produced inconsistent scoring — too lenient or too strict — and edge cases with cross-test dependencies were not reliably caught. v0.1.0: 283 runs, 48% overall pass rate (72% automation, 28% bug, 0% feature).

Stage 2 — Revised approach: Simulation tests removed. Focus shifted to edge cases and bug isolation, with structured pass/fail logic requiring multiple failure conditions rather than single binary judgments. Bug isolation became the priority — resolving one issue consistently exposed the next.

Stage 3 — Hybrid manual/automated: Automated evaluation disabled for most test cases. Manual validation introduced for bug-related tests: examine conversation logs, identify hallucinations, flow execution mistakes and intent classification errors, document in a shared sheet, log confirmed bugs in Jira. v0.2.6: 42 test cases, 330 runs, 14 test categories — all manually reviewed.

Category Tests What it covered
Design 10 Flow compliance, menu handling, greeting loops, phone number collection
Reliability 12 Emotional responses, frustration, passive aggression, interruption, memory, overlapping questions
Adversarial 8 Prompt injection, jailbreaking, security bypass, memory attacks, role-play, intent mixing
External testing 110 runs Real conversations from external testers fed directly into the framework
ASR / Human transcripts 3 Real voice pipeline output integrated as test input — ASR transcripts and human-reviewed transcripts
Malicious intent 3 Out-of-scope adversarial users, malicious prompts, intent confusion
Feature / Code bug 3 New features under validation; infrastructure-level failure handling

Toolkit

Architecture & Engineering

The toolkit consisted of three primary components: a WebSocket client, a folder manager, and a runner/reporting layer.

The client handled real-time communication with Emoflow — generating GPT completions, sending them over WebSocket, receiving and logging responses. I built both a single-processing version and a parallel-processing implementation (using daemon threads), the latter allowing multiple conversations to run concurrently and significantly increasing throughput.

The folder manager maintained a versioned hierarchy — tracking each EmoFlow model version, flow version, test case, prompt, and individual run with timestamps. Any change in the chatbot configuration triggered automatic creation of a new version folder, ensuring full traceability across deployment cycles.

The runner managed execution based on a config file: which test cases to run, how many times, whether to persist results, and whether to evaluate. The reporting layer compiled all run results into daily summaries, per-test-case reports, version-level summaries, and a centralised Google Sheet.

Research

Rasa CALM: Codebase Mapping & Architecture Analysis

In parallel with the evaluation work, I was tasked with understanding Rasa CALM deeply enough to inform whether Emoflow's architecture should be migrated toward it. This required reading the full open-source Rasa CALM codebase and producing a complete explanation of its architecture — how each component worked and how they connected.

The research produced a complete architectural analysis covering: the Dialogue Manager (FlowPolicy, processor, broker, channels), the Tracker (DialogueStateTracker — conversation history, slots, events), the Stack (DialogueStack — active flows, frame-based execution, ordered step sequencing), Dialogue Understanding (LLM Command Generator, Flow Retrieval, Command Processor), and Conversation Repair patterns.

A key finding was the fundamental architectural difference between EmoFlow and CALM: EmoFlow relied on the LLM to determine each step dynamically (unpredictable, hard to debug), while CALM constrained the LLM to selecting from a predefined command set (StartFlow, SetSlot, CancelFlow) with flow execution handled deterministically by FlowPolicy. This is the difference between an LLM driving the conversation and an LLM serving the conversation structure.

The research produced a concrete four-week migration plan — retaining the LLM for intent classification and slot extraction while transferring flow control to CALM's structured stack — which the engineering team could evaluate and act on.

Scope

Languages & Voice Pipeline

The system operated across English and multiple Arabic dialects. For the TTS/STT pipeline, I sampled inputs and outputs from Emoflow to assess voice pipeline quality — testing how well the system handled dialect-specific speech patterns and flagging failures that didn't surface in text-based testing. Transcription-based test cases were also supported in the toolkit, allowing real voice pipeline logs to be fed directly into the evaluation framework.

Outcomes

Results & Impact

  • The evaluation framework kept pace as the engineering team expanded from 3 to 12+ conversational flows — providing a quality signal at each version iteration
  • Developers received structured, reproducible bug reports (JSON logs, Jira tickets, versioned conversation traces) — eliminating time spent on ad-hoc bug reproduction
  • Automated stakeholder routing ensured every finding reached the right person; weekly reporting gave the team a consistent quality signal across versions
  • Full architectural documentation of Rasa CALM — codebase structure, dialogue stack, tracker, command generation pipeline — produced as a research deliverable the engineering team had not previously had access to
  • CALM migration roadmap delivered as a concrete four-week implementation plan

Tech Stack

Python OpenAI API GPT-4o / GPT-4o-mini LLM-as-a-Judge WebSockets Rasa CALM LiteLLM Prompt Engineering Arabic NLP TTS / STT Jira Confluence Google Sheets API