Gradient Descent: How Machines Actually Learn
Strip away the jargon and almost every model in modern AI — from a line through some house prices to Whisper — is trained by the same humble procedure: make a guess, measure how wrong it is, and nudge it slightly less wrong. Repeat a few million times. That procedure is gradient descent, and it is worth understanding once, clearly, because everything else is built on it.
A model is just a function with knobs
Take the classic example: predict a house's price from its features — size, location, number of
rooms. You assume some function h(x) that turns features x into a predicted
price. For a linear model, h(x) = θ₀ + θ₁x₁ + θ₂x₂ + …
— the θ values are the knobs (parameters), and each one says how much its feature
matters: is floor area more important than the neighbourhood, or less? Learning means finding the knob
settings that make the predictions good.
Measuring "wrong": the loss function
To improve the knobs you first need to score them. That score is the loss function
J(θ) — how far the model's predictions sit from the truth across your training
data. A standard choice for regression is the average squared error: for each example, take
(prediction − actual), square it (so misses in either direction count, and big misses
count a lot), and sum. Low J means good knobs; the whole job is to find the θ
that makes J as small as possible.
The descent
For a simple linear model you can solve that minimum with algebra (least squares). But algebra does not
scale, and most models are not simple, so we use something more general and more humble. Picture
J(θ) as a landscape of hills and valleys, with the knob settings as your position and
the height as your error. You want the lowest valley. Gradient descent is the strategy of someone
walking downhill in fog:
1. Start somewhere (random knob settings).
2. Feel which way is steepest downhill — the negative gradient.
3. Take a small step that way.
4. Repeat until you can't go any lower.
The size of each step is the learning rate (α). Too large and you
bound across the valley and overshoot; too small and you inch along forever. The gradient itself is just
the slope of J with respect to each knob — computed, in a neural network, by
backpropagation — and it points in the direction that increases error fastest, so you step the
opposite way.
The catch: which valley?
Downhill does not guarantee the lowest point. You can settle into a local valley that is not the global one — and where you end up can depend on where you started. For a linear model the landscape is a single bowl, so there is only one minimum and you are safe. For deep networks the landscape is far more rugged, with countless local minima — and yet, in practice, descent with good initialization and a few decades of accumulated tricks (momentum, Adam, learning-rate schedules) reliably finds settings that work. Why it works so well on such ugly landscapes is still partly a research question.
Why it's the whole ballgame
What scaled from this 2017-era house-price example to today's foundation models is not the idea — it is the size. The same loop trains a 3-parameter line and a multi-billion-parameter Transformer; the gradient is just computed across far more knobs, on far more data, with far more compute. Understand "walk downhill on the loss" and you understand the engine under every model in the rest of this blog.
Written originally as part of a 2017 machine-learning primer. Gradient descent itself is much older (Cauchy described it in 1847); what changed is the scale it now runs at. Part of our ML foundations notes.
