iVectors: How Classical ASR Learned to Hear the Speaker
Two people saying the same word produce very different audio; so does one person on two different microphones. Classical speech recognition needed a compact way to capture "who is speaking, and on what channel," so it could adapt to the voice in front of it. The answer, from 2011, was the iVector — a small fingerprint of an utterance that turned out to matter far beyond recognition.
From a universal model to a personal one
Start with a UBM (universal background model): one large Gaussian mixture model trained
on lots of speech from many people, representing the "average" speaker. Now take a particular utterance
and gently adapt the UBM's Gaussians toward it (a MAP adaptation), then stack all of those adapted means
into one enormous vector — the supervector. With C Gaussian
components and F-dimensional features, that supervector has C×F numbers:
tens of thousands of them. It captures the speaker, but it is far too big and redundant to use directly.
The trick: factor analysis
The insight is that real speakers and channels do not vary in tens of thousands of independent ways — they vary along a much smaller number of directions. So model the supervector as:
s = m + T w
Here s is the utterance's supervector, m is the UBM's mean supervector,
T is a low-rank "total variability" matrix learned with the EM algorithm, and
w — a small vector, often around 100 dimensions — is the iVector.
One compact vector now summarizes the speaker-and-channel identity of an entire utterance. Because the
iVector mixes speaker with channel, a follow-up projection (LDA, PLDA, or WCCN) strips out the channel
variation to leave a cleaner speaker representation.
Two payoffs
For recognition: feed the iVector alongside the MFCCs into a DNN acoustic model and it instantly knows whose voice it is modeling — speaker adaptation for free, with no per-speaker retraining. For years, iVectors were a standard input to hybrid DNN-HMM systems for exactly this reason.
Beyond recognition: the iVector was the backbone of speaker verification and diarization for the better part of a decade — it is how systems answered "is this the same person?" and "who spoke when?" That lineage runs straight into the speaker-aware embeddings behind modern speaker diarization, the feature that labels Speaker 1 versus Speaker 2 in a multi-person recording.
Where it sits in the evolution
The iVector is a perfect snapshot of the era's ingenuity: a hand-built statistical representation doing a job that neural embeddings — x-vectors, and later fully end-to-end speaker models — would soon learn directly from data. The specific technique has largely faded. The problem it framed so cleanly, though — reduce a voice to a compact vector so you can adapt to it and tell speakers apart — is more central to transcription now than it was then.
iVectors were introduced in Dehak, Kenny, Dehak, Dumouchel & Ouellet, "Front-End Factor Analysis for Speaker Verification" (2011). Part of our ASR evolution series — see the field guide for the full arc.
