Echo Cancellation: The Quiet Problem Behind Recording Both Sides of a Call
When a transcriber records your microphone and the far side of a call at the same time, it quietly solves a problem that occupied signal-processing engineers for decades. Without it, every recorded call collapses into a smear of each voice bleeding into the other. The problem is echo, and the idea that solves it is genuinely elegant — worth understanding if you care why a clean two-sided recording is even possible.
Two kinds of echo
Echo comes in two flavours. Acoustic echo is the one that matters for recording: in a hands-free or conference setup, the far-end voice plays out of your speaker, bounces around the room, and gets picked back up by your microphone — so the other person hears themselves a beat later, mixed into your reply. Line echo is the older cousin, caused by impedance mismatch in the two-to-four-wire hybrids of telephone circuits. The cure is the same in spirit: before you send (or record) the signal, strip the echo back out of it.
Why it looks impossible
Here is the trap. What the microphone captures is a single mixed signal — your speech plus the echo of theirs — and you need to remove just the echo while leaving your voice untouched. Separating two sounds that are already mixed into one recording is like pouring blue ink and red ink into the same bottle and being asked to pull the red back out. From that one signal alone, it cannot be done.
The trick: you have the reference
What rescues it is a second signal you already possess: the original far-end audio — the sound you
sent to the speaker in the first place, before the room got hold of it. Call it the reference.
The echo is just some transformed version of that reference: echo = F(reference), where
F is the echo path — everything the sound did on its way from speaker
to microphone (the reflections off walls and ceiling, or the coupling in the line). If you can figure
out F, you can predict the echo from the reference and subtract it. The reference is the
"çął" without which, as the old phrase goes, even a clever cook can make no rice.
Learning the echo path
You don't know F in advance — every room is different — so you learn it
with an adaptive filter. The filter starts as a guess and continuously adjusts its
coefficients to make "reference passed through the filter" match the actual echo as closely as possible,
shrinking the leftover error. The classic algorithm for this is LMS (least mean squares), which nudges
the coefficients down the slope of the error — the very same
gradient-descent idea that trains neural
networks, applied to a filter instead of a model. Once it converges, the filter is the echo
path: push the reference through it, get a faithful copy of the echo, subtract, and your voice is what
remains.
The genuinely hard part
Two requirements pull in opposite directions, and reconciling them is the whole art:
Double-talk. The filter can only learn cleanly when the microphone holds echo only. The moment you start speaking while the echo is also present, your speech — unrelated to the reference — corrupts the learning. So the system must detect double-talk and freeze adaptation while both sides speak. A moving target. The echo path changes the instant you shift in your chair or someone shuts a door, so the filter must be ready to re-adapt quickly. Fast adaptation versus rock-steady stability: you cannot fully have both at once, and every real echo canceller is a careful compromise between them.
From DSP to neural — and why it reaches your transcript
For decades this lived in classical signal processing — the echo canceller in every speakerphone. WebRTC later made a strong implementation open and ubiquitous (its audio engine grew out of GIPS, a company whose whole reputation was built on echo cancellation and packet-loss concealment), and neural echo cancellation is now pushing quality further still. The reason any of it matters for transcription is concrete: clean, echo-free channels mean the speech model hears each speaker distinctly rather than a smeared overlap — which is what makes recording both sides of a conversation something you can actually transcribe, and diarize, accurately.
Echo cancellation is a textbook application of adaptive signal processing; the LMS algorithm dates to Widrow & Hoff (1960). Modern open implementations include WebRTC's AEC, and neural approaches now extend it. Part of our notes on audio and recording.
