← All articles

CTC: How Speech Recognition Learned to Read Without Alignment

ASR evolution series · the technique dates to 2006, its end-to-end heyday was 2014–2020 · we first wrote this up in 2022, expanded here

Before about 2014, teaching a machine to transcribe speech meant solving a chicken-and-egg problem: to train the model you needed to know which slice of audio matched which sound — but recovering that alignment was itself the hard part. Connectionist Temporal Classification, introduced by Alex Graves in 2006, made the problem disappear. It's a small idea with outsized consequences — it's what let speech recognition go "end-to-end" — and it's the cleanest place to start tracing how ASR evolved into the models that now run, offline, on a laptop.

The alignment problem

Audio arrives as a long stream — roughly 100 feature frames per second. The transcript is short: "cat" is three characters. Nothing tells you that frames 1–8 are c, 9–20 are silence, 21–35 are a, and so on. The classical systems — the HMM-GMM models that Kaldi made famous — bootstrapped these alignments iteratively with the EM algorithm and "forced alignment": a multi-stage pipeline that worked, but was fiddly and coupled the acoustic model to a rigid frame-by-frame labeling. You could not simply hand a model raw audio and its transcript and say "learn."

The trick: a blank token and a collapse rule

CTC lets the network emit one symbol per frame — either a real token or a special blank, written — then collapses that raw per-frame output into final text in two steps: merge consecutive duplicates, then delete the blanks. So a frame sequence like ∅ccc∅aa∅∅ collapses to cat.

The blank does two jobs. It lets the model say "nothing new here" on the frames between sounds, and — crucially — it separates genuine repeats. The word hello needs both of its l's to survive, so the model learns to place a blank between them (…he∅l∅l∅o…) rather than letting them merge into one. That single extra symbol is what makes the whole scheme work.

Training without ever labeling a frame

Here is the part that earns CTC its place in the story. Many different frame sequences collapse to the same transcript — ∅∅cat, c∅a∅t∅, cc∅aa∅tt all become "cat". CTC does not pick one "correct" alignment. Its loss sums the probability of every valid alignment that produces the target text, and maximizes that total. The sum looks intractable — there are exponentially many paths — but it factorizes, and the forward-backward algorithm (the same dynamic-programming trick borrowed from HMMs) computes it efficiently.

The consequence is profound: you train on (audio, text) pairs alone. No frame labels, no forced alignment, no separate pronunciation step. The model learns the alignments as a byproduct of learning to read.

What CTC gave up

The simplicity has a price, and naming it is the whole reason the field kept moving. CTC assumes each frame's output is conditionally independent of the others given the audio. In plain terms: it has no built-in sense of language — it does not natively know that "recognize speech" is far more likely than "wreck a nice beach." Early CTC systems leaned on a separate external language model at decode time to patch this. CTC outputs also tend to be "peaky" (long runs of blanks punctuated by confident spikes), and the independence assumption caps accuracy on context-heavy speech. Good enough to change the game; not good enough to end it.

Where it sits in the evolution

Read the arc in three moves:

Before (the Kaldi era). HMM-GMM acoustic models, forced alignment, and a separate lexicon and language model — powerful, but a pipeline of specialists that had to be tuned in stages.

CTC (idea 2006; practice ~2014–2020). It collapsed the acoustic side into a single network you could train end-to-end, and powered the first wave of deep end-to-end recognizers — Graves' own work, Baidu's Deep Speech, EESEN, wav2letter — while making streaming recognition natural.

After. The RNN-Transducer (RNN-T) added a prediction network onto CTC's backbone to restore the context CTC threw away, and became the workhorse of on-device, streaming dictation. In parallel, attention models (Listen, Attend and Spell) and then Transformers folded alignment into attention itself and scaled into the systems we use today — Whisper among them. CTC never disappeared: it still lives inside hybrid CTC/attention models and as a fast forced-aligner. It simply stopped being the whole answer, which is what a healthy field looks like.

Why we trace this

This lineage matters to us because it is the same arc that made our own work possible. The move from cloud-scale, multi-stage HMM pipelines toward compact, end-to-end models is exactly what let high-quality transcription run on an ordinary machine with nothing leaving it. CTC was an early, decisive step down that road — the moment a recognizer stopped needing to be told where every sound was, and started figuring it out on its own.

CTC was introduced in Graves, Fernández, Gomez & Schmidhuber, "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks" (ICML 2006). This piece is part of our ASR evolution series — the techniques behind modern speech recognition, each dated to the era it actually mattered.