wav2vec 2.0: How Speech Recognition Learned From Audio Nobody Labeled
Every model in this series so far needed labeled audio — recordings paired with transcripts — and labeled audio is expensive and, for most of the world's languages, barely exists. The breakthrough that broke that dependency borrowed an idea straight from NLP's BERT: let the model learn the shape of speech from oceans of unlabeled audio first, then teach it to transcribe with only a little labeled data. wav2vec 2.0 (2020) is the landmark.
The labeled-data wall
Classical and end-to-end systems alike were hungry for transcribed audio — thousands of hours of it. For English that is merely costly; for most languages the data simply is not there. So the question that defined this era: could a model learn useful representations of speech without any transcripts at all?
The BERT idea, ported to audio
BERT, in text, masks a word and makes the model predict it from the surrounding context — self-supervised, no human labels, just the structure of language teaching itself. wav2vec 2.0 does the audio analogue. A convolutional encoder turns the raw waveform into a sequence of latent frames; the model masks spans of those frames; and a Transformer must work out what belonged in the gaps.
Two details matter. Why mask a span rather than a single frame? Because one frame is only ~20 ms and overlaps its neighbours, so the network could "cheat" by copying or averaging the frames on either side — learning nothing general. Masking a whole stretch forces it to actually understand speech. And because audio is continuous (unlike discrete words), wav2vec 2.0 quantizes the latents into a learned codebook and trains with a contrastive objective: from a set of candidates, pick the true unit for the masked region. Learn to do that across enough audio and you have learned what speech is made of.
Pre-train, then fine-tune
The recipe has two stages. An upstream model is pre-trained on huge amounts of unlabeled audio — the speech equivalent of how a vision model learns general image features before anyone shows it a specific task. Then a small downstream head is fine-tuned on a modest labeled set, often with a CTC objective. The payoff was startling: wav2vec 2.0 reached strong accuracy with as little as ten minutes to ten hours of labeled audio — on the order of a hundred times less than before. HuBERT and WavLM refined the same idea.
One representation, many tasks
Just as NLP uses the GLUE benchmark to test one language model across many tasks, speech has SUPERB, which evaluates a single pre-trained speech model on recognition, speaker identification, emotion, and more. The lesson is the important part: a good self-supervised representation is general. The same backbone that powers transcription also feeds speaker and diarization tasks — one model learned from raw audio, reused everywhere.
Where it sits — and a fork in the road
Self-supervision answered "where does the supervision come from?" with "learn from unlabeled audio, then fine-tune on a little." Two years later, Whisper answered the same question differently — scrape hundreds of thousands of hours of web audio that already came with rough transcripts, and skip fine-tuning entirely. Both routes tore down the labeled-data wall. The next post follows Whisper's, where this whole arc finally lands.
wav2vec 2.0 was introduced in Baevski, Zhou, Mohamed & Auli (2020); its successors include HuBERT (2021) and WavLM. Part of our ASR evolution series.
