ASR Evaluation Exploration

What it is

An evaluation framework for speech-to-text models, built to answer two questions:

Which ASR provider handles code-mixed Indian banking audio best?
Does the same model produce different output when served by different inference platforms?

I tested 3 managed ASR providers (Sarvam AI, ElevenLabs, OpenAI) and deployed Whisper Large v3 on 4 inference platforms (Baseten, Together AI, Groq, Fireworks AI). All evaluated against 28 audio files in Hindi, English, Kannada, and code-mixed speech.

Key findings

No "best" model. Sarvam leads overall WER (15%). OpenAI wins Hinglish code-mixed (9%). ElevenLabs scores 100% on banking entity accuracy.
Script normalization inflates WER by 40 percentage points. The single biggest discovery: different models write the same English loanwords in different scripts. Without normalizing for this, evaluation results are misleading.
Model quality beats post-processing. Upgrading from whisper-1 to gpt-4o-transcribe matched the improvement of an entire LLM correction pipeline, at zero cost.
Same model, different platform, different output. Up to 67pp WER divergence on the same audio file across inference platforms. Providers agree on easy audio and diverge on hard audio.

What I wrote about it

Evaluating speech-to-text models for Indian banking: the evaluation methodology, normalization discovery, and correction experiment
Does the inference platform matter?: deploying Whisper across 4 platforms, failure mode analysis, and the Baseten deployment experience
Interactive code walkthrough: annotated pipeline architecture