Attention Is All You Need: The Transformer Comes to Speech
Listen, Attend and Spell showed that attention could drive speech recognition — but it still leaned on recurrent networks that process audio one step at a time. In 2017 the Transformer threw recurrence out entirely. Its title was the thesis: Attention Is All You Need. It scaled where RNNs choked, and it is the architecture under Whisper and very nearly every modern model.
Self-attention, the core idea
A recurrent network passes information along a chain — each step depends on the one before it. Self-attention does away with the chain: every position looks at every other position directly and decides how much to weight each one. Three comparisons make the intuition concrete. A CNN is really a restricted self-attention — it only attends to a small local window. An RNN is sequential, so each output waits on the previous one; self-attention sees the whole sequence at once. And because there is no step-by-step dependency, self-attention is parallelizable — the property that let these models train on far more data than an RNN ever could.
Multi-head attention and positional encoding
The Transformer runs several attention "heads" in parallel, each free to learn a different kind of
relationship, then combines them — that is multi-head attention. But self-attention has a quirk:
it is order-blind. Looked at as a set, a b c and c b a are identical to it,
and order obviously matters for speech. The fix is a positional encoding — a
signal (classically a pattern of sines and cosines) added to each input vector so the model knows where
in the sequence it is.
Encoder and decoder
The Transformer is two stacks. Each encoder block is self-attention followed by a position-wise feed-forward network, each wrapped in a residual connection and layer normalization; pass the audio representation through N of these and you get an abstract encoding of the utterance. Each decoder block adds two things: masked self-attention, which can only look at tokens already emitted (it must not peek at the future it is trying to predict), and cross-attention, where the decoder queries the encoder's output to pull in the relevant audio. The decoder generates one token at a time, ending on an end-of-sentence marker.
The subtlety: teacher forcing and exposure bias
Like LAS, the Transformer decoder is autoregressive — each token depends on the last — and it is trained with teacher forcing, fed the ground-truth previous token rather than its own. That creates the same exposure bias: at inference the model must consume its own outputs, a distribution it never trained on, so an early slip can propagate. Researchers patch it with tricks like scheduled sampling (randomly feeding the model's own prediction during training). There is also a non-autoregressive variant that emits all tokens in parallel — much faster, generally less accurate, because it cannot condition on what it just said. The trade-off between speed and context never quite goes away; it just moves around.
Why it mattered for speech
No recurrence meant training could be parallelized, and parallel training meant scale — far more audio, far bigger models, with attention carrying long-range context that RNNs struggled to hold. Speech-Transformer (2018) brought the architecture to ASR directly, and by 2022 Whisper was a Transformer encoder-decoder trained on hundreds of thousands of hours of audio. The Transformer is the reason modern speech recognition is simultaneously more accurate and trainable at web scale.
Where it sits in the evolution
The Transformer is the chassis of everything that followed. The last stop before the foundation-model era tunes that chassis specifically for audio — the Conformer.
The Transformer was introduced in Vaswani et al., "Attention Is All You Need" (2017); its first direct ASR adaptation was the Speech-Transformer (Dong et al., 2018). Part of our ASR evolution series.
