March 22, 2026

/

1 min read

ASR Evaluation Exploration

View on GitHub

What it is

An evaluation framework for speech-to-text models, built to answer two questions:

  1. Which ASR provider handles code-mixed Indian banking audio best?
  2. Does the same model produce different output when served by different inference platforms?

I tested 3 managed ASR providers (Sarvam AI, ElevenLabs, OpenAI) and deployed Whisper Large v3 on 4 inference platforms (Baseten, Together AI, Groq, Fireworks AI). All evaluated against 28 audio files in Hindi, English, Kannada, and code-mixed speech.

Key findings

  • No "best" model. Sarvam leads overall WER (15%). OpenAI wins Hinglish code-mixed (9%). ElevenLabs scores 100% on banking entity accuracy.
  • Script normalization inflates WER by 40 percentage points. The single biggest discovery: different models write the same English loanwords in different scripts. Without normalizing for this, evaluation results are misleading.
  • Model quality beats post-processing. Upgrading from whisper-1 to gpt-4o-transcribe matched the improvement of an entire LLM correction pipeline, at zero cost.
  • Same model, different platform, different output. Up to 67pp WER divergence on the same audio file across inference platforms. Providers agree on easy audio and diverge on hard audio.

What I wrote about it