Chris Hunter — Résumé (1-page) · AI/QA Automation Lead

Chris Hunter

AI / QA Automation Lead

Concentrix GPO·Conversational voice-agent QA·AI evaluation & observability·linkedin.com/in/chrisahunter

QA automation lead specializing in testing conversational AI with AI — harnesses where an AI persona places live voice calls, judges them by listening to the audio, and closes the loop by rewriting the bot's own prompt. In eight months I took voice-bot QA from manual spot-checks to autonomous, audio-native, production-aware quality engineering across six internal AI products and multiple enterprise client engagements.

18+

Tools shipped

202

Tests green

Products QA'd

Models benched

250

Concurrent target

Autonomous Voice-Agent QA

Built an A/B + adversarial red-team harness for two production voice agents at a global card issuer — 4 agents, 27 scenarios, a 5-item rubric plus two audio judges — catching currency and gender defects transcript testing misses.47 tests
Productized it into a cross-platform pip CLI (14 subcommands) whose autonomous accept/revert loop lifted a red-team resistance score 67 → 100, auto-reverting regressions byte-for-byte.55 tests

Audio-Native Evaluation

Reverse-engineered the only working audio-aware LLM-judge path on a third-party platform (after proving the documented flag was ignored); the dual-judge pattern caught 5 wrong grammatical-gender forms, rupees-as-dollars, and a 24-second barge-in latency — then shipped it as a CLI + Chrome extension.

Production Observability & Load

Architected a production-to-observability bridge streaming redacted live voice conversations from New Relic into continuous LLM scoring — advancing QA from pre-deploy simulation to real-traffic monitoring, with an in-process PII redactor and per-run cost guard.58 tests
Designed a real-PSTN concurrency load test (ramp to 250 concurrent) for a multinational bank's collections dialer, diagnosing a launch-blocking carrier gap before it produced a false "passes at scale" result.

Performance, Reliability & Reporting

Built an LLM benchmark suite driving Azure AD B2C OAuth via Playwright to measure time-to-first-token and latency across up to 14 models, isolating platform overhead against direct-API baselines (~900 runs over 3 months).
Engineered a weekly prod-health rollup aggregating New Relic, Grafana Cloud, App Insights and Jira across six products into one RAG slide — cutting a 48,000 raw-failure count to 41 real incidents.

Developer Enablement & AI Infrastructure

Built an agentic test-writing CLI (Node, Claude Agent SDK) letting QA testers author Playwright tests conversationally via corporate Microsoft Foundry — removing the need for individual API keys.26 tests
Shipped a white-labeled Chrome extension for one-click evaluator runs (16 tests); root-caused a two-week LLM auth outage; validated a 1M-context model with accurate retrieval at 900k tokens.

Core Skills

Languages Python · Node/TypeScript · JavaScript · Bash    Testing Playwright · pytest · vitest · LLM-as-judge eval
AI / Agents Claude Agent SDK · MCP · multimodal judges · prompt engineering    Voice AI ElevenLabs · Twilio/PSTN · SIP · audio eval
Observability New Relic · Grafana Cloud · App Insights · Teams alerting    Cloud / LLM Azure AI Foundry · Anthropic · OpenAI · DeepSeek