OpenAI released Whisper in September 2022, and it immediately changed the voice recognition landscape. For the first time, an open-source model matched or exceeded commercial alternatives like Google Speech-to-Text and Amazon Transcribe on real-world audio. In 2026, Whisper (or Whisper-derived models) power the majority of consumer and developer dictation tools. Here's how it performs.
Whisper performance at a glance (2026)
WER = Word Error Rate. Lower is better. LibriSpeech is the standard read-speech benchmark. Real-world conversational accuracy varies by speaker, environment, and content.
Whisper model sizes compared
| Model | WER (LibriSpeech clean) | Speed (CPU, M2 Pro) | Use case |
|---|---|---|---|
| tiny | ~9.8% | ~0.3s | Real-time on-device, low accuracy |
| base | ~7.4% | ~0.6s | On-device balance |
| small | ~5.1% | ~1.4s | Good quality on-device |
| medium | ~3.6% | ~4.2s | High quality, slow on CPU |
| large-v3 | ~2.7% | ~12s | Best accuracy; requires GPU/LPU for speed |
AiType uses the large model on Groq's LPU hardware — this is why it achieves ~250ms at near-maximum accuracy. Running the large model on a CPU would take 10–15 seconds per clip, which is why on-device large-Whisper tools don't exist in practice.
Where Whisper is very accurate
- Standard American English, British English, Australian English. Sub-3% WER in quiet conditions.
- Technical vocabulary. Trained on a huge corpus including GitHub READMEs, academic papers, and technical documentation. Terms like "Kubernetes," "TypeScript," "tokenization," "indemnification" transcribe correctly without custom vocabulary.
- European languages. Spanish, French, German, Italian, Portuguese — all excellent, typically 3–5% WER.
- Clear audio. A quality microphone in a quiet room gets the best accuracy numbers.
Where Whisper struggles
- Heavy accents. Strong regional accents (Scottish, South African, Indian English in noisy conditions) can push WER to 10–15%.
- Noisy environments. Open offices, cafés, and outdoor use reduce accuracy significantly. Signal-to-noise ratio matters.
- Proper nouns and brand names. Unusual names, company names, and place names often get mangled — "Salesforce" is fine, "Zühlke" or "Twilio" less so.
- Fast, choppy speech. Whisper models prefer natural, flowing speech. Very short clips (<2 words) have higher error rates.
Whisper vs Google Speech-to-Text vs Azure (2026)
| Provider | WER (conversational) | Latency | Cost per hour |
|---|---|---|---|
| Whisper large-v3 (Groq) | ~4.2% | ~250ms | ~$0.003/min |
| Google Speech-to-Text v2 | ~4.8% | ~300–500ms | ~$0.016/min |
| Azure Speech (fast) | ~5.1% | ~400ms | ~$0.015/min |
| Amazon Transcribe | ~5.3% | ~500ms | ~$0.024/min |
| Apple Dictation (on-device) | ~3.5% | Instant | Free |
What AI cleanup adds on top of Whisper
Raw Whisper accuracy is typically 95–98% at the word level, but that's not the same as "usable text." You still get:
- No punctuation (Whisper produces flat text)
- Filler words ("um," "uh," "like," "you know")
- Run-on sentences without paragraph breaks
- Inconsistent capitalisation
- Verbatim phrasing that reads awkwardly
AiType's AI cleanup pass addresses all of these. The result isn't a more accurate transcription — it's better text from the same transcription.
Bottom line on Whisper accuracy
Whisper large-v3 on Groq hardware is the best price/performance ratio for English dictation in 2026. ~2.7% WER on clean audio, ~4% on conversational. Fast enough (~250ms) to replace typing in real time. The limitation isn't accuracy — it's the raw output that still needs formatting and cleanup. That's exactly what AiType's AI pass provides.
Also read: Whisper vs Groq: speed deep dive · On-device vs cloud dictation · Best dictation apps 2026
Try Groq-powered Whisper in AiType
14-day free trial. ~250ms latency with AI cleanup included.