AI / QA Automation Lead · Portfolio

I build AI that
tests AI.

Conversational voice-agent QA · autonomous eval harnesses · production observability

Eight months turning manual voice-bot testing into autonomous, audio-native, production-aware quality engineering — harnesses where an AI persona places live calls, judges by listening to the recording, and rewrites the bot's own prompt to fix what it finds.

18+
Tools & systems shipped
202
Automated tests green
6
AI products under QA
14
LLM models benchmarked
250
Concurrent-call load target
/01The thesis

Conversational AI shipped faster than anyone could test it. So I built the testers.

Systems where an AI caller runs the scenarios, a judge listens to the audio — not just the transcript — and the loop closes itself: score, improve, redeploy, retest. The same discipline now watches live production traffic, not just pre-release simulations.

/02What I do

Six capabilities, one throughline

Every project below is real, version-controlled where noted, and verified this session. Grades reflect engineering maturity — test coverage, versioning, and whether it runs today.

01

Autonomous voice-agent QA

Harnesses where an AI persona places live calls against two bot versions, scores each on a rubric, and A/B-compares — including seeded adversarial red-team scenarios.

02

Audio-native evaluation

LLM judges that listen to the call recording, catching currency mis-rendering, grammatical-gender errors, and barge-in latency that transcript-only testing throws away.

03

Production observability

Bridges that pull live redacted conversations from prod logs into automated LLM scoring — moving QA from pre-deploy simulation to real-traffic monitoring.

04

Performance & load engineering

Multi-model latency benchmarking that isolates platform overhead, plus real-PSTN concurrency load tests ramping toward 250 simultaneous calls.

05

Prod-health reporting

A weekly rollup aggregating four monitoring stacks across six products into one RAG slide for leadership — with defensible, per-source outage math.

06

Developer enablement

Agentic CLIs and extensions that let QA and marketing self-serve — writing Playwright tests by chat, running evaluators in one click, no personal API keys.

/03Selected work

The build log

Fourteen highlighted systems, grouped by capability. Grade badges A→C reflect maturity; the check tells you how it was verified in this audit.

Autonomous voice-agent QA
Voice A/B + red-team harness

Dual-Agent A/B Harness

A

A/B harness pitting stable vs proposed versions of two production voice bots for a global card issuer, scoring each call on a 5-item rubric plus two audio judges, with adversarial scenarios seeded from a real failed demo.

4 agents · 27 scenarios · 14 metrics · 8 dated runs
VERIFIED — 47 tests pass offline
Python ~7k LOCCekura RESTLLM judgepytest ×47
Nightly A/B + improvement loop

"Ruby" Self-Improving Harness

A−

An AI persona places side-by-side nightly calls against two prompt revisions of a collections bot, scores on a rubric, then feeds failures into a Claude prompt-improvement loop. The template every later client harness was built from.

12 tracked runs · best B=10/10 vs A=8/10 · 15 commits
VERIFIED — scripts compile, versioned
Python ~2.6k LOCAzure AnthropicElevenLabs V3Playwright
Distributable pip package

ruby-voice-qa

A−

The bespoke harness productized into a cross-platform CLI (ruby-qa, 14 subcommands). Its autopilot closes an autonomous evaluate → improve → redeploy → retest → accept/revert loop with a margin gate and byte-for-byte auto-revert.

Live arc: 67 → 100 red-team score, regressions auto-reverted
VERIFIED — 55 tests pass in 0.11s
Python pkghttpxpytest ×55console_scripts
Unified test-runner CLI

cekura-assistant

B

Packaged CLI + nightly runner driving a voice-eval platform via a hand-rolled MCP client across four bot environments — producing the documented catalog of recurring defects that anchored all later A/B work.

4 envs · ~42 scenarios · 5 agents · nightly ×10
PARTIAL — imports clean, no unit tests
Python 3.11MCPlaunchd cronRich UI
Audio-native evaluation
Reference architecture

The Audio Judge

A−

Reverse-engineered the only working audio-aware LLM-judge path on a third-party platform — after proving the documented flag was silently ignored. A dual-judge pattern that hears what speech-to-text destroys.

Caught: 5 wrong gender forms · rupees-as-dollars · 23,930 ms barge-in
VERIFIED — design doc + live catches
custom_code metricsmultimodal judgeClaude Sonnet
Manifest-driven CLI + extension

audio_judge toolkit

B+

Turned that hard-won insight into a YAML-driven CLI and a zero-backend Chrome extension, cutting audio-metric setup from a manual API session to a ~30-second edit — so non-developers can deploy judges themselves.

2 shipped judges (INR, gender) · byte-identical CLI/UI parity
PARTIAL — parses clean, live-only paths
Python CLIChrome MV3YAML manifests
Production observability & load
Prod → observability bridge

ix-observe

A

Streams redacted live voice-bot conversations from New Relic into an AI eval platform for continuous LLM scoring — advancing QA from pre-deploy simulation to real-traffic monitoring. In-process PII redactor; per-run cost guard.

7-metric pipeline · 50 conversations backfilled · 4-widget dashboard live
VERIFIED — 58 tests pass in 0.19s
Python ×12 modulesNerdGraphpytest ×58Teams alerts
Real-PSTN concurrency load test

Voicebot Load Harness

B+

A capacity load test ramping real carrier calls (1→250 concurrent) against a multinational bank's collections dialer, with a deterministic self-terminating flow that isolates load without flooding a live-agent queue.

Ramp to 250 · diagnosed India-termination blocker before launch
STAGED — harness built, exec gated on carrier
Cekura loadPSTN/SIPload metrics
Performance, reliability & reporting
Multi-model LLM benchmark suite

iXHello Perf Suite

A

Drives Azure AD B2C OAuth via Playwright to measure time-to-first-token and full-response latency across up to 14 GPT/Claude models on an internal platform, isolating platform overhead against direct-API baselines.

14 models · ~900 runs over 3 months · self-refreshing dashboard
VERIFIED — modules import, 900 run files
Python 3.13Playwright OAuthSSE timingheartbeat daemon
Multi-source prod-health rollup

QA Health RAG Rollup

A

Aggregates live telemetry from New Relic, Grafana Cloud, and Azure App Insights plus Jira across six products into one RAG slide — with a reverse-engineered Grafana-SSO workaround and defensible per-source outage rules.

6 products · 4 auth models · cut 48k raw failures → 41 real incidents
VERIFIED LIVE — ran & produced a real report
Python stdlibNRQL/PromQL/KQLPlaywright-SSOJira REST
Developer enablement & AI infra
Agentic test-writing CLI

ixqa

A

A Node CLI wrapping the Claude Agent SDK so QA testers write Playwright tests conversationally — using corporate Microsoft Foundry Claude instead of personal keys. Three-tier credential resolver, headless tool-gating, ixqa doctor live-probe.

Proven end-to-end · kills need for individual API keys
VERIFIED — 26 tests pass
Node/TSAgent SDKFoundry authvitest ×26
White-labeled Chrome extension

Collections Evaluator

B+

A one-click MV3 extension letting marketing run voice-AI evaluators against bots — no backend, direct API, fully white-labeled (zero vendor strings, test-enforced). Includes a reCAPTCHA-bypass setup flow.

5 evaluators × 3 envs · white-labeling enforced by tests
VERIFIED — 16 Playwright tests pass
Vanilla JS MV3TailwindPlaywright ×16
Release-readiness reporting

Test Coverage Report

B+

A zero-dependency Python generator that cross-references Jira items with Xray test executions into a self-contained HTML release-readiness report — reused via a Claude Code slash command.

Stdlib-only · shipped reports for 2 releases
VERIFIED — CLI renders, real artifacts
Python stdlibJira RESTXray GraphQL
Cloud AI enablement

Azure Foundry & DeepSeek

A−

Root-caused a two-week LLM auth outage (per-route header mismatch, six hypotheses eliminated), then deployed and validated a 1M-context model — proving accurate retrieval at 900k tokens and pricing out the build-vs-buy case.

Outage fixed in 1 line · 1M context verified · 9-model catalog audited
DOCUMENTED — dated end-to-end verifies
Azure AI FoundryAnthropic routeDeepSeek-V4
/04How it works

The systems, drawn

Four architectures that recur across the work — the closed test loop, the listening judge, the production bridge, and the health rollup.

01 · The autonomous test loop

An AI persona tests the bot, a dual judge scores it, an LLM rewrites the prompt — and only keeps changes that measurably win.

AI callerpersona + scenario
Live callreal voice, recorded
Dual judgeaudio + transcript
Rubric scoreper-item verdict
Improve promptLLM rewrite
Accept / revertmargin gate
↺ retest — regressions auto-reverted byte-for-byte
02 · The listening judge

One recording, two judges, merged — so failures that live only in the audio can't slip past the transcript.

Call recordingWAV via platform
Audio judgecurrency · gender · prosody · barge-in
Transcript judgeworkflow · hallucination
Mergereconcile verdicts
Verdictscored + evidenced
03 · The production bridge (ix-observe)

Real traffic becomes scored quality signal — redacted in-process, deduped, and cost-guarded before it ever leaves.

Prod voice callslive traffic
New Relic logsNerdGraph pull
PII redactorin-process
Map + dedupestate store
LLM scoring7 metrics
Dashboard + alertTeams spike
04 · The health rollup

Four heterogeneous monitoring sources, four auth models, one RAG slide leadership can read in ten seconds.

New RelicNRQL
Grafana CloudPromQL · SSO cookie
App InsightsKQL · az token
Jiraescape-rate signals
Normalizeuniform row contract
RAG matrix6 products × 4 metrics
/05The arc

Eight months, tooling → autonomy

A real progression: from one-off QA utilities, to nightly A/B harnesses, to productized packages, to watching live production.

Nov 2025 – Jan 2026

QA automation foundations

Xray test-step tooling and voice-infra utilities — including a 7-tool Twilio number/subaccount CLI to provision isolated test lines, and a React + Claude tool that answers New Relic questions in plain English.

Feb – Mar 2026

First voice-QA runner

A Dockerized prototype refactored into cekura-assistant — a unified CLI driving a voice-eval platform across four environments on a nightly schedule, producing the first catalog of recurring bot defects.

Mar 2026

Multi-model performance benchmarking

The iXHello perf suite launches: Playwright-driven OAuth, per-model TTFT/latency, direct-API baselines to isolate platform overhead. ~900 runs accumulate over the next three months.

Apr 2026

Reliability & enablement

Root-caused a two-week Azure Foundry auth outage. Shipped the white-labeled Collections Evaluator extension and the Jira+Xray coverage report, and scaffolded the multi-source QA health rollup — including a Grafana-SSO cookie workaround.

May 2026

Audio-native breakthrough

Proved the platform's audio-judge flag was silently ignored, then reverse-engineered the working path. Shipped judges that listen — catching currency and Hindi-gender defects — across the EMI suite and the dual-agent A/B + red-team harness.

Jun 2026

Autonomy & live production

Productized the harness into the ruby-voice-qa pip package (autonomous accept/revert loop). Shipped ix-observe — live prod conversations into continuous scoring. Built ixqa (agentic test-writing CLI), validated a 1M-context model, and staged a 250-concurrent load test.

/06Trust the numbers

How this portfolio was verified

Every claim here was re-checked in a single audit session by six parallel agents — reading source, running test suites, and grading maturity. No live production calls; no client data reproduced.

Tests actually run

202 tests executed green across 5 suites — A/B harness 47, ruby-voice-qa 55, ix-observe 58, ixqa 26, collections 16.

Live where possible

The QA health rollup was run end-to-end during the audit and produced a real week's report.

A–C

Honest grading

Grades dock for missing tests or version control. Prototypes and blocked work are labeled as such, not inflated.

Anonymized & safe

Client names generalized; credentials and prod telemetry excluded before anything was shared.