RNN-T: The Model That Put Speech Recognition on Your Phone
CTC made speech recognition end-to-end, but it had a blind spot: it treated every output as independent, with no memory of what it had just said. The RNN-Transducer (RNN-T), introduced by Alex Graves in 2012, closed that gap without giving up the thing that made CTC practical — and in doing so it became the architecture that put real-time, high-quality recognition on phones, offline.
What CTC couldn't do
Recall how CTC works: it emits one token (or a blank) per audio frame, then collapses repeats and drops the blanks. Its great weakness is the conditional-independence assumption — each frame's output is decided without reference to the others, so the model has no built-in sense of language and leans on a separate language model at decode time. It also emits at most one token per input frame, which is fine for speech but a poor fit when the output should be denser than the input.
The bridge: a recurrent aligner
The fix arrived in a small step first. The Recurrent Neural Aligner (RNA) made one change to CTC: let the current output look at the previous output, and swap CTC's linear classifier for a recurrent one. Suddenly the model had a thread of memory running through its predictions. RNN-T generalizes that idea into a clean architecture.
How RNN-T works
RNN-T is three networks working together:
An encoder (the transcription network) reads the audio frames — the same role CTC's network plays. A prediction network — a small RNN — reads the tokens the model has already emitted; it is, in effect, an internal language model. And a joint network combines the two and decides the next output: a token, or a blank to advance to the next frame.
Two things fall out of this design. Because every output is conditioned on the words emitted so far,
RNN-T has the context CTC lacked — it knows what it just said. And because the joint
network can fire repeatedly on a single frame before emitting a blank, RNN-T decouples output length
from input length: one frame can produce several tokens. Following its own collapse rule, an emission
like th∅e∅∅_∅ (where _ is a space) reads back as
the.
Why it mattered: streaming on the edge
RNN-T is frame-synchronous — it can emit text as the audio arrives, without waiting for the end of the utterance. That is exactly the property attention models like Listen, Attend and Spell gave up: they need the whole clip in hand before they start. Combine streaming with built-in context and you have the natural fit for live, on-device dictation. In 2019 Google moved its on-device speech recognizer to an RNN-T, taking a model that used to live in the data center and shrinking it to something that ran on a phone, offline. That is the same leap that makes laptop-class offline transcription possible.
The catch
RNN-T earns its power the hard way. Its training loss sums over a two-dimensional grid of alignments (time on one axis, output tokens on the other), which is memory-hungry and finicky — a large part of the engineering around RNN-T is taming that cost so it fits in GPU memory. But the payoff — context plus streaming plus on-device — was worth it.
Where it sits in the evolution
Between CTC (alignment-free but context-free) and the full attention models and Transformers (context-rich but not naturally streaming), RNN-T occupies the practical middle: the model you reach for when recognition has to happen live, on the device. Next in the series, we follow the other branch — the attention line, from Listen, Attend and Spell to the Transformer that reshaped everything.
RNN-T was introduced in Graves, "Sequence Transduction with Recurrent Neural Networks" (2012); its on-device deployment is described in Google's "Streaming End-to-End Speech Recognition for Mobile Devices" (2019). Part of our ASR evolution series.
