iVectors: How Classical ASR Learned to Hear the Speaker

ASR evolution series · Era 1, the iVector (2011) · originally 2021, expanded here

Two people saying the same word produce very different audio; so does one person on two different microphones. Classical speech recognition needed a compact way to capture "who is speaking, and on what channel," so it could adapt to the voice in front of it. The answer, from 2011, was the iVector — a small fingerprint of an utterance that turned out to matter far beyond recognition.

From a universal model to a personal one

Start with a UBM (universal background model): one large Gaussian mixture model trained on lots of speech from many people, representing the "average" speaker. Now take a particular utterance and gently adapt the UBM's Gaussians toward it (a MAP adaptation), then stack all of those adapted means into one enormous vector — the supervector. With C Gaussian components and F-dimensional features, that supervector has C×F numbers: tens of thousands of them. It captures the speaker, but it is far too big and redundant to use directly.

The trick: factor analysis

The insight is that real speakers and channels do not vary in tens of thousands of independent ways — they vary along a much smaller number of directions. So model the supervector as:

s = m + T w

Here s is the utterance's supervector, m is the UBM's mean supervector, T is a low-rank "total variability" matrix learned with the EM algorithm, and w — a small vector, often around 100 dimensions — is the iVector. One compact vector now summarizes the speaker-and-channel identity of an entire utterance. Because the iVector mixes speaker with channel, a follow-up projection (LDA, PLDA, or WCCN) strips out the channel variation to leave a cleaner speaker representation.

Two payoffs

For recognition: feed the iVector alongside the MFCCs into a DNN acoustic model and it instantly knows whose voice it is modeling — speaker adaptation for free, with no per-speaker retraining. For years, iVectors were a standard input to hybrid DNN-HMM systems for exactly this reason.

Beyond recognition: the iVector was the backbone of speaker verification and diarization for the better part of a decade — it is how systems answered "is this the same person?" and "who spoke when?" That lineage runs straight into the speaker-aware embeddings behind modern speaker diarization, the feature that labels Speaker 1 versus Speaker 2 in a multi-person recording.

Where it sits in the evolution

The iVector is a perfect snapshot of the era's ingenuity: a hand-built statistical representation doing a job that neural embeddings — x-vectors, and later fully end-to-end speaker models — would soon learn directly from data. The specific technique has largely faded. The problem it framed so cleanly, though — reduce a voice to a compact vector so you can adapt to it and tell speakers apart — is more central to transcription now than it was then.

iVectors were introduced in Dehak, Kenny, Dehak, Dumouchel & Ouellet, "Front-End Factor Analysis for Speaker Verification" (2011). Part of our ASR evolution series — see the field guide for the full arc.

iVectors: How Classical ASR Learned to Hear the Speaker

From a universal model to a personal one

The trick: factor analysis

Two payoffs

Where it sits in the evolution

SUPPORT

LINKS

INFORMATION

ABOUT US

FOLLOW US

From a universal model to a personal one

The trick: factor analysis

Two payoffs

Where it sits in the evolution

SUPPORT

LINKS

INFORMATION

ABOUT US

FOLLOW US

Cookie Settings

Essential Cookies

Analytics Cookies

Functionality Cookies

Targeting Cookies