← All articles

HCLG: How Classical ASR Turned Sound Into Words With a Cascade of Graphs

ASR evolution series · Era 1, WFST decoding (~2002 onward) · originally 2022, expanded here

Suppose you have an acoustic model, a lexicon, and a language model. How do you search the astronomically large space of possible transcripts fast enough to be useful? The classical answer is one of the most elegant ideas in all of speech recognition: compile every source of knowledge into a single weighted graph, and walk it. The graph is a weighted finite-state transducer, and in Kaldi it goes by the name HCLG.

What a WFST is

Build the idea up in three steps. A finite-state acceptor (FSA) is a graph that reads a string and decides whether to accept it. A weighted acceptor (WFSA) also attaches a probability to that acceptance. A weighted finite-state transducer (WFST) goes one further: every arc carries an input label, an output label, and a weight, so the graph translates one sequence into another while scoring it. Three operations make transducers composable into bigger ones: compose (chain two transducers), determinize, and minimize. Kaldi uses OpenFST for these (with its own tuned implementations for speed).

The HCLG cascade

The trick is that each piece of ASR knowledge is itself a transducer, and composing them yields one graph that does the whole job. From the language side inward:

G — the grammar, i.e. the n-gram language model, as a weighted acceptor over words. L — the lexicon, mapping phone sequences to words. C — context dependency, expanding phones into triphones. H — the HMM, mapping state sequences to those triphones. Compose them — H ∘ C ∘ L ∘ G — and you get HCLG: a single static graph whose input labels are HMM transition-ids and whose output labels are words. The probabilities thread through correctly — word-to-word from the LM, word-to-phone from the lexicon, phone-to-state from the HMM's transitions.

Decoding is a walk through the graph

With HCLG built once, recognition is a graph search. The acoustic model scores how well each frame of MFCCs matches each HMM state; the Viterbi algorithm threads those scores through HCLG to find the best-scoring path; and because the graph is enormous, pruning (beam search and histogram pruning) keeps only the promising paths alive. Read the output labels along the winning path, map the ids back through a symbol table, and you have your transcript.

Lattices — keeping the alternatives

The single best path is often not the best answer, so a good decoder keeps a lattice: a compact graph of the top competing hypotheses, each arc tagged with its acoustic cost and graph (LM) cost. This enables a powerful two-pass trick — decode quickly with a small language model to produce the lattice, then rescore that lattice with a much larger or neural language model to pick the final winner, without ever re-running the expensive acoustic search. N-best lists are just the lattice flattened into its top few word sequences.

Why it still matters

WFST decoding was a high point of pre-neural engineering: it made the search modular, exact, and fast, and the core idea — compose your knowledge sources into one searchable graph — long outlived the GMM era. Lattices in particular are still how systems carry uncertainty forward for rescoring. The end-to-end models that followed absorbed the lexicon and language model into the network, which is what let most of this beautiful machinery be retired — but understanding it is the fastest way to appreciate what those models quietly replaced.

The WFST approach to speech decoding was developed by Mohri, Pereira & Riley (around 2002); Kaldi builds its HCLG graphs on the OpenFST library. Part of our ASR evolution series.