← All articles

The Conformer: Giving the Transformer an Ear for Audio

ASR evolution series · the Conformer, 2020 · originally 2022, expanded here

The Transformer was built for text, where any word can relate to any other across a sentence. Speech is also that — but it is intensely local too: formants, transitions, the fine texture of a single phoneme all live in neighbouring frames. The Conformer (2020) gives the Transformer that local ear by folding convolution back in, and it became a default backbone for high-accuracy speech recognition.

The gap it fills

Self-attention and convolution have opposite strengths. Attention is built for the global picture — long-range relationships across the whole utterance — but it is not a natural fit for fine local patterns. A CNN is the mirror image: excellent at local detail within a small window, weak at anything far away. Speech recognition genuinely needs both, and earlier architectures only really had one. The Conformer's whole premise is to stop choosing.

What changed versus the Transformer

The Conformer keeps the Transformer's encoder shape and makes a few pointed changes:

It adds a dedicated convolution module inside every encoder block, so each layer captures local acoustic structure alongside the global view that self-attention provides. It wraps the block in two feed-forward layers rather than one — a "macaron" sandwich, each contributing a half-step residual. And in its attention it uses relative positional encoding (the Transformer-XL style) rather than the original absolute positional encoding, which suits variable-length speech better. (A common misreading is that the Conformer drops positional information because the convolution handles it — it does not; it switches to relative positions.) SpecAugment, masking patches of the spectrogram, supplies the data augmentation.

Why it won

The combination did what neither half could alone: global context from attention, local precision from convolution. That pairing delivered state-of-the-art word error rates on the standard benchmarks and made the Conformer encoder the go-to choice for production-grade ASR — the acoustic workhorse a great many systems quietly run on.

Where it sits — and where the road leads

With the Conformer the pieces are all on the table: alignment-free training from CTC, streaming and context from RNN-T, attention from LAS, scale from the Transformer, and an ear for audio from the convolution module. The foundation-model era simply assembles them at scale — Whisper and its peers — and quantization shrinks the result until it runs, offline, on the machine that recorded the audio. That is where this whole arc has been heading, and where it currently rests.

The Conformer was introduced in Gulati et al., "Conformer: Convolution-augmented Transformer for Speech Recognition" (2020). Part of our ASR evolution series — see the field guide for the full arc.