‹ Home
AI / QA Automation Lead · Portfolio Deck

I build AI that
tests AI.

Chris Hunter · Concentrix GPO
Conversational voice-agent QA · autonomous eval harnesses · production observability

The problem

Conversational AI shipped faster
than anyone could test it.

Voice bots fail in ways transcripts never show — wrong currency spoken aloud, wrong grammatical gender, a two-second delay before it stops talking. Manual QA can't keep pace, and pre-release testing never sees real callers.

So I built the testers.

By the numbers

Eight months of shipping

18+
Tools & systems
202
Tests green
6
AI products under QA
14
Models benchmarked
250
Concurrent-call target

All figures re-verified in one audit — test suites run, source read, maturity graded.

What I do

Six capabilities, one throughline

01

Autonomous voice QA

An AI persona places live calls against two bot versions and A/B-scores each — with seeded red-team scenarios.

02

Audio-native evaluation

Judges that listen to the recording, catching what speech-to-text throws away.

03

Production observability

Live redacted prod conversations streamed into continuous LLM scoring.

04

Performance & load

Multi-model latency benchmarking and real-PSTN concurrency load tests.

05

Prod-health reporting

Four monitoring stacks, six products, one RAG slide for leadership.

06

Developer enablement

Agentic CLIs and extensions so QA and marketing self-serve.

Flagship pattern

The loop tests, judges, and fixes itself

AI caller
persona + scenario
Live call
real voice, recorded
Dual judge
audio + transcript
Score
per-item rubric
Improve
LLM rewrite
Accept / revert
margin gate
↺ retest — regressions auto-reverted byte-for-byte

Productized as ruby-voice-qa — a live run lifted a red-team resistance score from 67 → 100, auto-reverting every regression along the way.

Deep dive · novel contribution

The judge that listens

  • Proved the platform's audio flag was silently ignored, then reverse-engineered the one working path.
  • A dual-judge pattern: one hears the recording, one reads the transcript — verdicts merged.
  • Packaged into a YAML CLI + zero-backend Chrome extension so non-devs deploy judges in ~30 seconds.
Defects caught — audio only
Rupees spoken as dollars✓ caught
Wrong gender forms (Hindi)5 found
Barge-in stop latency23,930 ms
custom_codemultimodalChrome MV3
Deep dive · shift to production

Watching live traffic, not just tests

Grade A · 58 tests green
  • Streams redacted live voice conversations from New Relic into an AI eval platform for continuous scoring.
  • In-process PII redactor and a per-run cost guard — nothing sensitive leaves, spend is bounded.
  • Advances QA from pre-deploy simulation to real-traffic monitoring.
ix-observe · live pipeline
Quality metrics scored7
Conversations backfilled50
Dashboard widgets4
Automated tests58 ✓
Python ×12NerdGraphTeams alert
Deep dive · reliability at scale

Benchmarks and one honest health slide

  • Perf suite: Playwright-driven OAuth measures TTFT + latency across up to 14 models, isolating platform overhead — ~900 runs over three months.
  • Health rollup: four monitoring stacks + Jira across six products into one RAG slide; ran live during the audit.
  • Defensible math: cut a 48,000 raw-failure count down to 41 real incidents.
Health rollup · verified live
Products aggregated6
Auth models bridged4
SourcesNR · Grafana · AppIns · Jira
Ran end-to-end✓ real report
The arc

Tooling → autonomy in eight months

Nov–Jan

Foundations

QA + voice-infra utilities; Twilio & New Relic tooling.

Feb–Mar

First runner

Unified nightly voice-QA CLI across four envs.

Mar

Benchmarks

Multi-model perf suite launches.

Apr

Reliability

Auth outage fixed; extensions + health rollup.

May

Audio-native

Judges that listen; A/B + red-team harness.

Jun

Autonomy + live

Pip package, ix-observe, agentic CLI, 1M-context.

Trust the numbers

Nothing here is inflated

Six parallel agents re-audited every claim in one session — reading source, running suites, grading maturity honestly.

Tests actually ran

202 green across five suites — not asserted, executed.

A–C

Honest grades

Docked for missing tests or VCS; prototypes labeled as such.

Safe to share

Clients anonymized; credentials and prod data excluded.

In one line

The bots got smart.
So did the testing.

Chris Hunter · AI / QA Automation Lead · let's talk

chris_hunter.qa
← → or space to navigate
01 / 11