← All articles

How Speech Recognition Evolved: From HMMs to Whisper

ASR evolution series · the field guide

Speech-to-text didn't arrive fully formed. It got here by shedding complexity in stages — from room-sized statistical pipelines that took a team of specialists to tune, to a single model that runs offline on a laptop. We've worked with most of these techniques over the years; this series walks the path one idea at a time. This page is the map.

Era 1 — The statistical pipeline (1990s–2014)

For two decades the state of the art was an assembly line. An acoustic model (HMM-GMM) scored which phoneme each frame of audio looked like; a pronunciation lexicon mapped phonemes to words; an n-gram language model scored which word sequences were plausible; and a WFST decoder searched the combination. Features were hand-engineered (MFCCs), speakers were normalized with tricks like iVectors and CMVN, and the whole thing trained in careful stages with forced alignment. It worked — Kaldi made it the research backbone of the field — but every stage was its own specialty. See: the HMM-GMM pipeline, WFST decoding, and iVectors.

Era 2 — The end-to-end break (2014–2017)

The first great simplification: collapse the acoustic pipeline into a single neural network you could train on audio-and-text alone. Connectionist Temporal Classification (CTC) removed forced alignment by summing over every alignment at once. Sequence-to-sequence models with attention (Listen, Attend and Spell) went further, letting one network "read" audio and emit text directly. Baidu's Deep Speech showed the approach could scale. See: Listen, Attend and Spell.

Era 3 — Streaming, attention, and the Transformer (2017–2021)

Two needs pulled the field forward at once: context and latency. The RNN-Transducer (RNN-T) bolted a small internal language model onto CTC's backbone so the model finally understood context — and, crucially, it streamed, which is what put high-quality recognition on phones. In parallel, the Transformer ("Attention Is All You Need") replaced recurrence with self-attention, the Conformer added convolution back in to capture local acoustic detail, and self-supervised pretraining (wav2vec 2.0) let models learn from mountains of unlabeled audio. See: the Transformer, the Conformer, and self-supervised learning.

Era 4 — Foundation models (2022→)

Whisper folded the lessons together: one Transformer encoder-decoder, trained on hundreds of thousands of hours of weakly-labeled web audio, multilingual, and robust enough to ship without per-domain tuning. The same era brought Parakeet and a wave of strong open models. Quantization and GPU offload shrank them enough to run locally. The assembly line of Era 1 had become a single download. See: Whisper and the foundation-model era.

The throughline — why we keep this map

Read top to bottom and one trend dominates: every era compressed the pipeline and pushed capability toward the edge. Fewer stages, less hand-engineering, more learned directly from data — until the whole system fit on the device that recorded the audio. That end state is the one we build for: high-quality transcription that runs entirely on your machine, with nothing leaving it. The history isn't nostalgia — it's the reason offline transcription is finally good enough to trust.

This is the index for our ASR evolution series. Each linked piece goes deep on one technique, dated to the era it actually mattered. New entries are added as we publish them.