AI / QA Automation Lead · Portfolio Deck

I build AI that
tests AI.

Chris Hunter · Concentrix GPO
Conversational voice-agent QA · autonomous eval harnesses · production observability

The problem

Conversational AI shipped faster
than anyone could test it.

Voice bots fail in ways transcripts never show — wrong currency spoken aloud, wrong grammatical gender, a two-second delay before it stops talking. Manual QA can't keep pace, and pre-release testing never sees real callers.

So I built the testers.

By the numbers

Eight months of shipping

18+

Tools & systems

202

Tests green

6

AI products under QA

14

Models benchmarked

250

Concurrent-call target

All figures re-verified in one audit — test suites run, source read, maturity graded.

What I do

Six capabilities, one throughline

01

Autonomous voice QA

An AI persona places live calls against two bot versions and A/B-scores each — with seeded red-team scenarios.

02

Audio-native evaluation

Judges that listen to the recording, catching what speech-to-text throws away.

03

Production observability

Live redacted prod conversations streamed into continuous LLM scoring.

04

Performance & load

Multi-model latency benchmarking and real-PSTN concurrency load tests.

05

Prod-health reporting

Four monitoring stacks, six products, one RAG slide for leadership.

06

Developer enablement

Agentic CLIs and extensions so QA and marketing self-serve.

Flagship pattern

The loop tests, judges, and fixes itself

AI caller

persona + scenario

→

Live call

real voice, recorded

→

Dual judge

audio + transcript

→

Score

per-item rubric

→

Improve

LLM rewrite

→

Accept / revert

margin gate

↺ retest — regressions auto-reverted byte-for-byte

Productized as ruby-voice-qa — a live run lifted a red-team resistance score from 67 → 100, auto-reverting every regression along the way.

Deep dive · novel contribution

The judge that listens

Proved the platform's audio flag was silently ignored, then reverse-engineered the one working path.
A dual-judge pattern: one hears the recording, one reads the transcript — verdicts merged.
Packaged into a YAML CLI + zero-backend Chrome extension so non-devs deploy judges in ~30 seconds.

Defects caught — audio only

Rupees spoken as dollars✓ caught

Wrong gender forms (Hindi)5 found

Barge-in stop latency23,930 ms

custom_codemultimodalChrome MV3

Deep dive · shift to production

Watching live traffic, not just tests

Grade A · 58 tests green

Streams redacted live voice conversations from New Relic into an AI eval platform for continuous scoring.
In-process PII redactor and a per-run cost guard — nothing sensitive leaves, spend is bounded.
Advances QA from pre-deploy simulation to real-traffic monitoring.

ix-observe · live pipeline

Quality metrics scored7

Conversations backfilled50

Dashboard widgets4

Automated tests58 ✓

Python ×12NerdGraphTeams alert

Deep dive · reliability at scale

Benchmarks and one honest health slide

Perf suite: Playwright-driven OAuth measures TTFT + latency across up to 14 models, isolating platform overhead — ~900 runs over three months.
Health rollup: four monitoring stacks + Jira across six products into one RAG slide; ran live during the audit.
Defensible math: cut a 48,000 raw-failure count down to 41 real incidents.

Health rollup · verified live

Products aggregated6

Auth models bridged4

SourcesNR · Grafana · AppIns · Jira

Ran end-to-end✓ real report

The arc

Tooling → autonomy in eight months

Nov–Jan

Foundations

QA + voice-infra utilities; Twilio & New Relic tooling.

Feb–Mar

First runner

Unified nightly voice-QA CLI across four envs.

Mar

Benchmarks

Multi-model perf suite launches.

Apr

Reliability

Auth outage fixed; extensions + health rollup.

May

Audio-native

Judges that listen; A/B + red-team harness.

Jun

Autonomy + live

Pip package, ix-observe, agentic CLI, 1M-context.

Trust the numbers

Nothing here is inflated

Six parallel agents re-audited every claim in one session — reading source, running suites, grading maturity honestly.

✓

Tests actually ran

202 green across five suites — not asserted, executed.

A–C

Honest grades

Docked for missing tests or VCS; prototypes labeled as such.

∅

Safe to share

Clients anonymized; credentials and prod data excluded.

In one line

The bots got smart.
So did the testing.

Chris Hunter · AI / QA Automation Lead · let's talk