QA automation lead specializing in testing conversational AI with AI — harnesses where an AI persona places live voice calls, judges them by listening to the audio, and closes the loop by rewriting the bot's own prompt. In eight months I took voice-bot QA from manual spot-checks to autonomous, audio-native, production-aware quality engineering across six internal AI products and multiple enterprise client engagements.
18+
Tools shipped
202
Tests green
6
Products QA'd
14
Models benched
250
Concurrent target
Autonomous Voice-Agent QA
Built an A/B + adversarial red-team harness for two production voice agents at a global card issuer — 4 agents, 27 scenarios, a 5-item rubric plus two audio judges — catching currency and gender defects transcript testing misses.47 tests
Productized it into a cross-platform pip CLI (14 subcommands) whose autonomous accept/revert loop lifted a red-team resistance score 67 → 100, auto-reverting regressions byte-for-byte.55 tests
Audio-Native Evaluation
Reverse-engineered the only working audio-aware LLM-judge path on a third-party platform (after proving the documented flag was ignored); the dual-judge pattern caught 5 wrong grammatical-gender forms, rupees-as-dollars, and a 24-second barge-in latency — then shipped it as a CLI + Chrome extension.
Production Observability & Load
Architected a production-to-observability bridge streaming redacted live voice conversations from New Relic into continuous LLM scoring — advancing QA from pre-deploy simulation to real-traffic monitoring, with an in-process PII redactor and per-run cost guard.58 tests
Designed a real-PSTN concurrency load test (ramp to 250 concurrent) for a multinational bank's collections dialer, diagnosing a launch-blocking carrier gap before it produced a false "passes at scale" result.
Performance, Reliability & Reporting
Built an LLM benchmark suite driving Azure AD B2C OAuth via Playwright to measure time-to-first-token and latency across up to 14 models, isolating platform overhead against direct-API baselines (~900 runs over 3 months).
Engineered a weekly prod-health rollup aggregating New Relic, Grafana Cloud, App Insights and Jira across six products into one RAG slide — cutting a 48,000 raw-failure count to 41 real incidents.
Developer Enablement & AI Infrastructure
Built an agentic test-writing CLI (Node, Claude Agent SDK) letting QA testers author Playwright tests conversationally via corporate Microsoft Foundry — removing the need for individual API keys.26 tests
Shipped a white-labeled Chrome extension for one-click evaluator runs (16 tests); root-caused a two-week LLM auth outage; validated a 1M-context model with accurate retrieval at 900k tokens.
Core Skills
Languages Python · Node/TypeScript · JavaScript · Bash Testing Playwright · pytest · vitest · LLM-as-judge eval AI / Agents Claude Agent SDK · MCP · multimodal judges · prompt engineering Voice AI ElevenLabs · Twilio/PSTN · SIP · audio eval Observability New Relic · Grafana Cloud · App Insights · Teams alerting Cloud / LLM Azure AI Foundry · Anthropic · OpenAI · DeepSeek