Before Deep Learning: How HMM-GMM and Kaldi Did Speech Recognition

ASR evolution series · Era 1, the classical pipeline (~1990s–2014) · originally 2021, expanded here

For two decades before end-to-end models, speech recognition was not one network but a pipeline of specialized parts, each trained and tuned in turn. From 2011 on, the open-source toolkit Kaldi was what ran that pipeline for most of the research world. It is the baseline the rest of this series measures against — and the machinery a lot of early speech work, ours included, was built on.

Step 1 — Features (MFCC)

You cannot feed a raw waveform to a recognizer; you first turn it into a sequence of feature vectors, one every 10 milliseconds or so. The classic choice is MFCCs (Mel-frequency cepstral coefficients), which compress a slice of audio into a dozen-odd numbers along a frequency scale shaped like human hearing. Filterbank (fBank) and PLP features are close cousins. Then comes a small but critical clean-up step, CMVN (cepstral mean and variance normalization): it shifts and scales the features to zero mean and unit variance, so the same phoneme spoken into a cheap laptop mic and a studio condenser ends up looking similar. Without it, the model wastes capacity learning the microphone instead of the speech.

Step 2 — The acoustic model (GMM-HMM)

Here is the hard part: mapping fuzzy, variable features to discrete sounds. The HMM (hidden Markov model) treats each phoneme as a tiny state machine — a few hidden states wired together with transition probabilities. The states are hidden; what you observe is the MFCCs. So the central question is the emission probability: how likely is this feature vector, given that hidden state? The classical answer is a GMM (Gaussian mixture model), which effectively clusters the messy space of MFCCs so that similar vectors map to the same state. Training alternates two moves: alignment (use Viterbi to decide which frames belong to which states) and re-estimating the GMMs on those frames, via the EM algorithm.

One refinement does most of the heavy lifting for accuracy: context. A "t" sounds different before "ee" than before "oo", so instead of modeling single phones (monophones) we model triphones — each phone in the context of its neighbours. That explodes the number of states, so a decision tree ties similar ones together to keep the parameter count sane. This is the MFCC + mono + triphone ladder every Kaldi recipe climbs.

Step 3 — Lexicon and language model

The acoustic model deals in phones; people want words. A pronunciation lexicon maps phone sequences to words, and an n-gram language model scores which word sequences are plausible — the part that knows "recognize speech" is far likelier than "wreck a nice beach." The recognizer's job is to combine acoustic scores and language scores and search for the best overall sentence. (How that search is actually done — by compiling everything into one graph — is its own story, in the WFST piece.)

Step 4 — Squeezing out the last 20%

Two more stages separated a demo from a product. First, discriminative training: after the GMMs are trained to fit the data (maximum likelihood), criteria like MMI, boosted MMI, MPE and sMBR re-train the model to make the correct transcript win against its competitors — not just "fit the right answer" but "beat the wrong ones." It was routinely worth a 10–20% relative drop in error. Second, speaker adaptation: techniques such as fMLLR, VTLN and iVectors nudged the model toward the particular voice and channel in front of it.

Why it's worth knowing

This pipeline was genuinely powerful — it ran the world's ASR research for years. But look at how many separate, hand-engineered, sequentially-trained boxes it took: features, a clustering acoustic model, a lexicon, a language model, a search graph, and several adaptation passes on top. The entire arc of this series is the story of those boxes collapsing, one by one, into a single model you train end-to-end. CTC knocked out the first of them.

MFCCs date to Davis & Mermelstein (1980); the GMM-HMM paradigm dominated ASR through roughly 2012, and the Kaldi toolkit (Povey et al., 2011) became its standard implementation. Part of our ASR evolution series.

Before Deep Learning: How HMM-GMM and Kaldi Did Speech Recognition

Step 1 — Features (MFCC)

Step 2 — The acoustic model (GMM-HMM)

Step 3 — Lexicon and language model

Step 4 — Squeezing out the last 20%

Why it's worth knowing

SUPPORT

LINKS

INFORMATION

ABOUT US

FOLLOW US

Step 1 — Features (MFCC)

Step 2 — The acoustic model (GMM-HMM)

Step 3 — Lexicon and language model

Step 4 — Squeezing out the last 20%

Why it's worth knowing

SUPPORT

LINKS

INFORMATION

ABOUT US

FOLLOW US

Cookie Settings

Essential Cookies

Analytics Cookies

Functionality Cookies

Targeting Cookies