← All articles

Listen, Attend and Spell: How Attention Taught ASR to Read

ASR evolution series · the attention branch, ~2015 · originally 2022, expanded here

CTC and RNN-T emit text as the audio flows past. A different idea, borrowed from machine translation, does the opposite: read the whole utterance first, then write the transcript while "attending" to the slice of audio that matters for each word. That idea is sequence-to-sequence with attention, and its landmark form in speech is Listen, Attend and Spell (2015).

The sequence-to-sequence idea

Sequence-to-sequence (also called encoder-decoder) came out of machine translation: one network (the encoder) compresses an input sequence into an internal representation, and a second network (the decoder) generates an output sequence from it. The mould fits an enormous range of problems — translation, summarization, even question answering — and ASR is just one more instance of it: speech in, text out. Listen, Attend and Spell is exactly this, with each of its three words naming a stage.

Listen — the encoder

The "Listen" stage is a stacked, pyramidal recurrent encoder. It reads acoustic features (MFCCs or filterbanks) and, layer by layer, downsamples them in time — speech is a very long sequence, and folding several frames into one at each level keeps the next stage tractable. The output is a shorter sequence of high-level vectors, each summarizing a chunk of the audio.

Attend — attention as a soft search

Here is the pivotal part. At every output step, the decoder computes a set of weights over all of the encoder's vectors — a soft search for "which part of the audio matters right now" — and forms a context vector as their weighted sum. The weights come from a similarity score (a dot product between the decoder's state and each encoder vector, or an additive variant), passed through a softmax. Those weights are the attention scores; the context vector is what the decoder actually reads. Nothing forces a rigid left-to-right alignment — the model learns where to look.

Spell — the decoder

The "Spell" stage is an autoregressive decoder. It takes the context vector plus the previously emitted token, produces a probability distribution over the next token (a character or word-piece), emits the most likely one, and repeats — each new token reshaping the next attention query — until it produces an end-of-sentence marker. Training minimizes cross-entropy against the true transcript, with one important trick: teacher forcing. During training the decoder is fed the ground-truth previous token rather than its own guess, because early in training its own guesses are random and would poison everything downstream.

What it bought, and what it cost

LAS bought a lot: a single clean network trained on audio and text, no forced alignment, and strong context — the decoder can see the whole utterance and everything it has written so far, which is why attention models tend to be highly accurate. But it cost two things. It is not streaming: it needs the entire clip before it can attend, so it cannot transcribe live as you speak — precisely the gap that kept RNN-T relevant for on-device, real-time use. And teacher forcing creates exposure bias: at test time the decoder must rely on its own previous outputs, a situation it never saw in training, so one early mistake can cascade.

Where it sits in the evolution

LAS proved that attention-based sequence-to-sequence worked for speech, and set the template the Transformer would soon perfect — by throwing out the recurrent network at the heart of LAS and replacing it with attention all the way down. That is the next stop in the series.

Listen, Attend and Spell was introduced in Chan, Jaitly, Le & Vinyals (2015); the attention mechanism it builds on comes from Bahdanau, Cho & Bengio (2015). Part of our ASR evolution series.