AI / QA Automation Lead · Portfolio Deck
I build AI that
tests AI.
Chris Hunter · Concentrix GPO
Conversational voice-agent QA · autonomous eval harnesses · production observability
The problem
Conversational AI shipped faster
than anyone could test it.
Voice bots fail in ways transcripts never show — wrong currency spoken aloud, wrong grammatical gender, a two-second delay before it stops talking. Manual QA can't keep pace, and pre-release testing never sees real callers.
So I built the testers.
By the numbers
Eight months of shipping
250
Concurrent-call target
All figures re-verified in one audit — test suites run, source read, maturity graded.
What I do
Six capabilities, one throughline
01
Autonomous voice QA
An AI persona places live calls against two bot versions and A/B-scores each — with seeded red-team scenarios.
02
Audio-native evaluation
Judges that listen to the recording, catching what speech-to-text throws away.
03
Production observability
Live redacted prod conversations streamed into continuous LLM scoring.
04
Performance & load
Multi-model latency benchmarking and real-PSTN concurrency load tests.
05
Prod-health reporting
Four monitoring stacks, six products, one RAG slide for leadership.
06
Developer enablement
Agentic CLIs and extensions so QA and marketing self-serve.
Flagship pattern
The loop tests, judges, and fixes itself
AI caller
persona + scenario
→
Live call
real voice, recorded
→
Dual judge
audio + transcript
→
→
→
Accept / revert
margin gate
↺ retest — regressions auto-reverted byte-for-byte
Productized as ruby-voice-qa — a live run lifted a red-team resistance score from 67 → 100, auto-reverting every regression along the way.
Deep dive · novel contribution
The judge that listens
- Proved the platform's audio flag was silently ignored, then reverse-engineered the one working path.
- A dual-judge pattern: one hears the recording, one reads the transcript — verdicts merged.
- Packaged into a YAML CLI + zero-backend Chrome extension so non-devs deploy judges in ~30 seconds.
Defects caught — audio only
Rupees spoken as dollars✓ caught
Wrong gender forms (Hindi)5 found
Barge-in stop latency23,930 ms
custom_codemultimodalChrome MV3
Deep dive · shift to production
Watching live traffic, not just tests
Grade A · 58 tests green
- Streams redacted live voice conversations from New Relic into an AI eval platform for continuous scoring.
- In-process PII redactor and a per-run cost guard — nothing sensitive leaves, spend is bounded.
- Advances QA from pre-deploy simulation to real-traffic monitoring.
ix-observe · live pipeline
Quality metrics scored7
Conversations backfilled50
Dashboard widgets4
Automated tests58 ✓
Python ×12NerdGraphTeams alert
Deep dive · reliability at scale
Benchmarks and one honest health slide
- Perf suite: Playwright-driven OAuth measures TTFT + latency across up to 14 models, isolating platform overhead — ~900 runs over three months.
- Health rollup: four monitoring stacks + Jira across six products into one RAG slide; ran live during the audit.
- Defensible math: cut a 48,000 raw-failure count down to 41 real incidents.
Health rollup · verified live
Products aggregated6
Auth models bridged4
SourcesNR · Grafana · AppIns · Jira
Ran end-to-end✓ real report
The arc
Tooling → autonomy in eight months
Nov–Jan
Foundations
QA + voice-infra utilities; Twilio & New Relic tooling.
Feb–Mar
First runner
Unified nightly voice-QA CLI across four envs.
Mar
Benchmarks
Multi-model perf suite launches.
Apr
Reliability
Auth outage fixed; extensions + health rollup.
May
Audio-native
Judges that listen; A/B + red-team harness.
Jun
Autonomy + live
Pip package, ix-observe, agentic CLI, 1M-context.
Trust the numbers
Nothing here is inflated
Six parallel agents re-audited every claim in one session — reading source, running suites, grading maturity honestly.
✓
Tests actually ran
202 green across five suites — not asserted, executed.
A–C
Honest grades
Docked for missing tests or VCS; prototypes labeled as such.
∅
Safe to share
Clients anonymized; credentials and prod data excluded.
In one line
The bots got smart.
So did the testing.
Chris Hunter · AI / QA Automation Lead · let's talk