What it is
An evaluation framework for speech-to-text models, built to answer two questions:
- Which ASR provider handles code-mixed Indian banking audio best?
- Does the same model produce different output when served by different inference platforms?
I tested 3 managed ASR providers (Sarvam AI, ElevenLabs, OpenAI) and deployed Whisper Large v3 on 4 inference platforms (Baseten, Together AI, Groq, Fireworks AI). All evaluated against 28 audio files in Hindi, English, Kannada, and code-mixed speech.
Key findings
- No "best" model. Sarvam leads overall WER (15%). OpenAI wins Hinglish code-mixed (9%). ElevenLabs scores 100% on banking entity accuracy.
- Script normalization inflates WER by 40 percentage points. The single biggest discovery: different models write the same English loanwords in different scripts. Without normalizing for this, evaluation results are misleading.
- Model quality beats post-processing. Upgrading from whisper-1 to gpt-4o-transcribe matched the improvement of an entire LLM correction pipeline, at zero cost.
- Same model, different platform, different output. Up to 67pp WER divergence on the same audio file across inference platforms. Providers agree on easy audio and diverge on hard audio.
What I wrote about it
- Evaluating speech-to-text models for Indian banking: the evaluation methodology, normalization discovery, and correction experiment
- Does the inference platform matter?: deploying Whisper across 4 platforms, failure mode analysis, and the Baseten deployment experience
- Interactive code walkthrough: annotated pipeline architecture