Chris Hunter — AI / QA Automation Lead

/02What I do

Six capabilities, one throughline

Every project below is real, version-controlled where noted, and verified this session. Grades reflect engineering maturity — test coverage, versioning, and whether it runs today.

Autonomous voice-agent QA

Harnesses where an AI persona places live calls against two bot versions, scores each on a rubric, and A/B-compares — including seeded adversarial red-team scenarios.

Audio-native evaluation

LLM judges that listen to the call recording, catching currency mis-rendering, grammatical-gender errors, and barge-in latency that transcript-only testing throws away.

Production observability

Bridges that pull live redacted conversations from prod logs into automated LLM scoring — moving QA from pre-deploy simulation to real-traffic monitoring.

Performance & load engineering

Multi-model latency benchmarking that isolates platform overhead, plus real-PSTN concurrency load tests ramping toward 250 simultaneous calls.

Prod-health reporting

A weekly rollup aggregating four monitoring stacks across six products into one RAG slide for leadership — with defensible, per-source outage math.

Developer enablement

Agentic CLIs and extensions that let QA and marketing self-serve — writing Playwright tests by chat, running evaluators in one click, no personal API keys.

/03Selected work

The build log

Fourteen highlighted systems, grouped by capability. Grade badges A→C reflect maturity; the check tells you how it was verified in this audit.

Autonomous voice-agent QA

Voice A/B + red-team harness

Dual-Agent A/B Harness

A/B harness pitting stable vs proposed versions of two production voice bots for a global card issuer, scoring each call on a 5-item rubric plus two audio judges, with adversarial scenarios seeded from a real failed demo.

4 agents · 27 scenarios · 14 metrics · 8 dated runs

✓ VERIFIED — 47 tests pass offline

Python ~7k LOCCekura RESTLLM judgepytest ×47

Nightly A/B + improvement loop

"Ruby" Self-Improving Harness

A−

An AI persona places side-by-side nightly calls against two prompt revisions of a collections bot, scores on a rubric, then feeds failures into a Claude prompt-improvement loop. The template every later client harness was built from.

12 tracked runs · best B=10/10 vs A=8/10 · 15 commits

✓ VERIFIED — scripts compile, versioned

Python ~2.6k LOCAzure AnthropicElevenLabs V3Playwright

Distributable pip package

ruby-voice-qa

A−

The bespoke harness productized into a cross-platform CLI (ruby-qa, 14 subcommands). Its autopilot closes an autonomous evaluate → improve → redeploy → retest → accept/revert loop with a margin gate and byte-for-byte auto-revert.

Live arc: 67 → 100 red-team score, regressions auto-reverted

✓ VERIFIED — 55 tests pass in 0.11s

Python pkghttpxpytest ×55console_scripts

Unified test-runner CLI

cekura-assistant

Packaged CLI + nightly runner driving a voice-eval platform via a hand-rolled MCP client across four bot environments — producing the documented catalog of recurring defects that anchored all later A/B work.

4 envs · ~42 scenarios · 5 agents · nightly ×10

◐ PARTIAL — imports clean, no unit tests

Python 3.11MCPlaunchd cronRich UI

Audio-native evaluation

Reference architecture

The Audio Judge

A−

Reverse-engineered the only working audio-aware LLM-judge path on a third-party platform — after proving the documented flag was silently ignored. A dual-judge pattern that hears what speech-to-text destroys.

Caught: 5 wrong gender forms · rupees-as-dollars · 23,930 ms barge-in

◆ VERIFIED — design doc + live catches

custom_code metricsmultimodal judgeClaude Sonnet

Manifest-driven CLI + extension

audio_judge toolkit

B+

Turned that hard-won insight into a YAML-driven CLI and a zero-backend Chrome extension, cutting audio-metric setup from a manual API session to a ~30-second edit — so non-developers can deploy judges themselves.

2 shipped judges (INR, gender) · byte-identical CLI/UI parity

◐ PARTIAL — parses clean, live-only paths

Python CLIChrome MV3YAML manifests

Production observability & load

Prod → observability bridge

ix-observe

Streams redacted live voice-bot conversations from New Relic into an AI eval platform for continuous LLM scoring — advancing QA from pre-deploy simulation to real-traffic monitoring. In-process PII redactor; per-run cost guard.

7-metric pipeline · 50 conversations backfilled · 4-widget dashboard live

✓ VERIFIED — 58 tests pass in 0.19s

Python ×12 modulesNerdGraphpytest ×58Teams alerts

Real-PSTN concurrency load test

Voicebot Load Harness

B+

A capacity load test ramping real carrier calls (1→250 concurrent) against a multinational bank's collections dialer, with a deterministic self-terminating flow that isolates load without flooding a live-agent queue.

Ramp to 250 · diagnosed India-termination blocker before launch

◆ STAGED — harness built, exec gated on carrier

Cekura loadPSTN/SIPload metrics

Performance, reliability & reporting

Multi-model LLM benchmark suite

iXHello Perf Suite

Drives Azure AD B2C OAuth via Playwright to measure time-to-first-token and full-response latency across up to 14 GPT/Claude models on an internal platform, isolating platform overhead against direct-API baselines.

14 models · ~900 runs over 3 months · self-refreshing dashboard

✓ VERIFIED — modules import, 900 run files

Python 3.13Playwright OAuthSSE timingheartbeat daemon

Multi-source prod-health rollup

QA Health RAG Rollup

Aggregates live telemetry from New Relic, Grafana Cloud, and Azure App Insights plus Jira across six products into one RAG slide — with a reverse-engineered Grafana-SSO workaround and defensible per-source outage rules.

6 products · 4 auth models · cut 48k raw failures → 41 real incidents

✓ VERIFIED LIVE — ran & produced a real report

Python stdlibNRQL/PromQL/KQLPlaywright-SSOJira REST

Developer enablement & AI infra

Agentic test-writing CLI

ixqa

A Node CLI wrapping the Claude Agent SDK so QA testers write Playwright tests conversationally — using corporate Microsoft Foundry Claude instead of personal keys. Three-tier credential resolver, headless tool-gating, ixqa doctor live-probe.

Proven end-to-end · kills need for individual API keys

✓ VERIFIED — 26 tests pass

Node/TSAgent SDKFoundry authvitest ×26

White-labeled Chrome extension

Collections Evaluator

B+

A one-click MV3 extension letting marketing run voice-AI evaluators against bots — no backend, direct API, fully white-labeled (zero vendor strings, test-enforced). Includes a reCAPTCHA-bypass setup flow.

5 evaluators × 3 envs · white-labeling enforced by tests

✓ VERIFIED — 16 Playwright tests pass

Vanilla JS MV3TailwindPlaywright ×16

Release-readiness reporting

Test Coverage Report

B+

A zero-dependency Python generator that cross-references Jira items with Xray test executions into a self-contained HTML release-readiness report — reused via a Claude Code slash command.

Stdlib-only · shipped reports for 2 releases

✓ VERIFIED — CLI renders, real artifacts

Python stdlibJira RESTXray GraphQL

Cloud AI enablement

Azure Foundry & DeepSeek

A−

Root-caused a two-week LLM auth outage (per-route header mismatch, six hypotheses eliminated), then deployed and validated a 1M-context model — proving accurate retrieval at 900k tokens and pricing out the build-vs-buy case.

Outage fixed in 1 line · 1M context verified · 9-model catalog audited

◆ DOCUMENTED — dated end-to-end verifies

Azure AI FoundryAnthropic routeDeepSeek-V4

/04How it works

The systems, drawn

Four architectures that recur across the work — the closed test loop, the listening judge, the production bridge, and the health rollup.

01 · The autonomous test loop

An AI persona tests the bot, a dual judge scores it, an LLM rewrites the prompt — and only keeps changes that measurably win.

AI callerpersona + scenario

→

Live callreal voice, recorded

→

Dual judgeaudio + transcript

→

Rubric scoreper-item verdict

→

Improve promptLLM rewrite

→

Accept / revertmargin gate

↺ retest — regressions auto-reverted byte-for-byte

02 · The listening judge

One recording, two judges, merged — so failures that live only in the audio can't slip past the transcript.

Call recordingWAV via platform

→

Audio judgecurrency · gender · prosody · barge-in

Transcript judgeworkflow · hallucination

→

Mergereconcile verdicts

→

Verdictscored + evidenced

03 · The production bridge (ix-observe)

Real traffic becomes scored quality signal — redacted in-process, deduped, and cost-guarded before it ever leaves.

Prod voice callslive traffic

→

New Relic logsNerdGraph pull

→

PII redactorin-process

→

Map + dedupestate store

→

LLM scoring7 metrics

→

Dashboard + alertTeams spike

04 · The health rollup

Four heterogeneous monitoring sources, four auth models, one RAG slide leadership can read in ten seconds.

New RelicNRQL

Grafana CloudPromQL · SSO cookie

App InsightsKQL · az token

Jiraescape-rate signals

→

Normalizeuniform row contract

→

RAG matrix6 products × 4 metrics

/05The arc

Eight months, tooling → autonomy

A real progression: from one-off QA utilities, to nightly A/B harnesses, to productized packages, to watching live production.

Nov 2025 – Jan 2026

QA automation foundations

Xray test-step tooling and voice-infra utilities — including a 7-tool Twilio number/subaccount CLI to provision isolated test lines, and a React + Claude tool that answers New Relic questions in plain English.

Feb – Mar 2026

First voice-QA runner

A Dockerized prototype refactored into cekura-assistant — a unified CLI driving a voice-eval platform across four environments on a nightly schedule, producing the first catalog of recurring bot defects.

Mar 2026

Multi-model performance benchmarking

The iXHello perf suite launches: Playwright-driven OAuth, per-model TTFT/latency, direct-API baselines to isolate platform overhead. ~900 runs accumulate over the next three months.

Apr 2026

Reliability & enablement

Root-caused a two-week Azure Foundry auth outage. Shipped the white-labeled Collections Evaluator extension and the Jira+Xray coverage report, and scaffolded the multi-source QA health rollup — including a Grafana-SSO cookie workaround.

May 2026

Audio-native breakthrough

Proved the platform's audio-judge flag was silently ignored, then reverse-engineered the working path. Shipped judges that listen — catching currency and Hindi-gender defects — across the EMI suite and the dual-agent A/B + red-team harness.

Jun 2026

Autonomy & live production

Productized the harness into the ruby-voice-qa pip package (autonomous accept/revert loop). Shipped ix-observe — live prod conversations into continuous scoring. Built ixqa (agentic test-writing CLI), validated a 1M-context model, and staged a 250-concurrent load test.

I build AI that
tests AI.

Six capabilities, one throughline

Autonomous voice-agent QA

Audio-native evaluation

Production observability

Performance & load engineering

Prod-health reporting

Developer enablement

The build log

Dual-Agent A/B Harness

"Ruby" Self-Improving Harness

ruby-voice-qa

cekura-assistant

The Audio Judge

audio_judge toolkit

ix-observe

Voicebot Load Harness

iXHello Perf Suite

QA Health RAG Rollup

ixqa

Collections Evaluator

Test Coverage Report

Azure Foundry & DeepSeek

The systems, drawn

Eight months, tooling → autonomy

QA automation foundations

First voice-QA runner

Multi-model performance benchmarking

Reliability & enablement

Audio-native breakthrough

Autonomy & live production

How this portfolio was verified

Tests actually run

Live where possible

Honest grading

Anonymized & safe