Latest Writing
Articles
Does the inference platform matter?
I deployed Whisper Large v3 on four inference platforms and got up to 67 percentage points of WER divergence on the same audio file. Same model, same input, different output.
Evaluating speech-to-text models for Indian banking
How do you evaluate a model that has no system prompt? I tested three ASR providers on code-mixed banking conversations and found that my measurement was more broken than the models.
What I learned building a speech-to-text app from scratch
Why do dictated words just 'appear' and why pay for Wispr Flow when open-source models exist? I built a local STT app to find out.
Revisiting the questions AI asked me: An ode to the AskUserQuestion tool
The QnA with Claude are the best part of my AI sessions. So I built a tool to resurface them.
Keeping context fresh for PM worklfows
How I leverage Claude Code with Claude in Chrome to keep PM context fresh and automated across recurring data workflows
Agent Teams for Product Managers
Can AI agents that argue with each other help a PM stress-test a product hypothesis? I tested Anthropic's Agent Teams feature to find out
Side Projects
Projects
ASR Evaluation Exploration
An evaluation framework for speech-to-text models and inference platforms, tested on code-mixed Indian banking audio across 7 providers and 4 deployment platforms.
Vox
A native macOS speech-to-text menu bar app, built to understand what makes great dictation software great.
Claude QA Viewer
A zero-dependency tool that extracts AskUserQuestion interactions from Claude Code sessions and generates an interactive HTML visualization.
Support signal
A Python tool that automates Zendesk ticket analysis using LLMs, turning weeks of manual triage into a 2-hour automated run.
Featured Work
Case Studies
Diagnostics - Helping customers ship with confidence
Building a self-service troubleshooting tool that reduced L1 support tickets by 35% and serves 700+ customers.
35%
Ticket reduction
Prototype to Production: Evals for AI reliability
From prompt to rule: building a 4-dimension LLM-as-judge framework that improved accuracy from 45% to 85%.
85%
Accuracy