Before Deep Learning: How HMM-GMM and Kaldi Did Speech Recognition
For two decades before end-to-end models, speech recognition was not one network but a pipeline of specialized parts, each trained and tuned in turn. From 2011 on, the open-source toolkit Kaldi was what ran that pipeline for most of the research world. It is the baseline the rest of this series measures against — and the machinery a lot of early speech work, ours included, was built on.
Step 1 — Features (MFCC)
You cannot feed a raw waveform to a recognizer; you first turn it into a sequence of feature vectors, one every 10 milliseconds or so. The classic choice is MFCCs (Mel-frequency cepstral coefficients), which compress a slice of audio into a dozen-odd numbers along a frequency scale shaped like human hearing. Filterbank (fBank) and PLP features are close cousins. Then comes a small but critical clean-up step, CMVN (cepstral mean and variance normalization): it shifts and scales the features to zero mean and unit variance, so the same phoneme spoken into a cheap laptop mic and a studio condenser ends up looking similar. Without it, the model wastes capacity learning the microphone instead of the speech.
Step 2 — The acoustic model (GMM-HMM)
Here is the hard part: mapping fuzzy, variable features to discrete sounds. The HMM (hidden Markov model) treats each phoneme as a tiny state machine — a few hidden states wired together with transition probabilities. The states are hidden; what you observe is the MFCCs. So the central question is the emission probability: how likely is this feature vector, given that hidden state? The classical answer is a GMM (Gaussian mixture model), which effectively clusters the messy space of MFCCs so that similar vectors map to the same state. Training alternates two moves: alignment (use Viterbi to decide which frames belong to which states) and re-estimating the GMMs on those frames, via the EM algorithm.
One refinement does most of the heavy lifting for accuracy: context. A "t" sounds different before "ee"
than before "oo", so instead of modeling single phones (monophones) we model triphones
— each phone in the context of its neighbours. That explodes the number of states, so a decision
tree ties similar ones together to keep the parameter count sane. This is the
MFCC + mono + triphone ladder every Kaldi recipe climbs.
Step 3 — Lexicon and language model
The acoustic model deals in phones; people want words. A pronunciation lexicon maps phone sequences to words, and an n-gram language model scores which word sequences are plausible — the part that knows "recognize speech" is far likelier than "wreck a nice beach." The recognizer's job is to combine acoustic scores and language scores and search for the best overall sentence. (How that search is actually done — by compiling everything into one graph — is its own story, in the WFST piece.)
Step 4 — Squeezing out the last 20%
Two more stages separated a demo from a product. First, discriminative training: after the GMMs are trained to fit the data (maximum likelihood), criteria like MMI, boosted MMI, MPE and sMBR re-train the model to make the correct transcript win against its competitors — not just "fit the right answer" but "beat the wrong ones." It was routinely worth a 10–20% relative drop in error. Second, speaker adaptation: techniques such as fMLLR, VTLN and iVectors nudged the model toward the particular voice and channel in front of it.
Why it's worth knowing
This pipeline was genuinely powerful — it ran the world's ASR research for years. But look at how many separate, hand-engineered, sequentially-trained boxes it took: features, a clustering acoustic model, a lexicon, a language model, a search graph, and several adaptation passes on top. The entire arc of this series is the story of those boxes collapsing, one by one, into a single model you train end-to-end. CTC knocked out the first of them.
MFCCs date to Davis & Mermelstein (1980); the GMM-HMM paradigm dominated ASR through roughly 2012, and the Kaldi toolkit (Povey et al., 2011) became its standard implementation. Part of our ASR evolution series.
