Gradient Descent: How Machines Actually Learn

ML foundations · originally 2017, expanded here

Strip away the jargon and almost every model in modern AI — from a line through some house prices to Whisper — is trained by the same humble procedure: make a guess, measure how wrong it is, and nudge it slightly less wrong. Repeat a few million times. That procedure is gradient descent, and it is worth understanding once, clearly, because everything else is built on it.

A model is just a function with knobs

Take the classic example: predict a house's price from its features — size, location, number of rooms. You assume some function h(x) that turns features x into a predicted price. For a linear model, h(x) = θ₀ + θ₁x₁ + θ₂x₂ + … — the θ values are the knobs (parameters), and each one says how much its feature matters: is floor area more important than the neighbourhood, or less? Learning means finding the knob settings that make the predictions good.

Measuring "wrong": the loss function

To improve the knobs you first need to score them. That score is the loss function J(θ) — how far the model's predictions sit from the truth across your training data. A standard choice for regression is the average squared error: for each example, take (prediction − actual), square it (so misses in either direction count, and big misses count a lot), and sum. Low J means good knobs; the whole job is to find the θ that makes J as small as possible.

The descent

For a simple linear model you can solve that minimum with algebra (least squares). But algebra does not scale, and most models are not simple, so we use something more general and more humble. Picture J(θ) as a landscape of hills and valleys, with the knob settings as your position and the height as your error. You want the lowest valley. Gradient descent is the strategy of someone walking downhill in fog:

1. Start somewhere (random knob settings).
2. Feel which way is steepest downhill — the negative gradient.
3. Take a small step that way.
4. Repeat until you can't go any lower.

The size of each step is the learning rate (α). Too large and you bound across the valley and overshoot; too small and you inch along forever. The gradient itself is just the slope of J with respect to each knob — computed, in a neural network, by backpropagation — and it points in the direction that increases error fastest, so you step the opposite way.

The catch: which valley?

Downhill does not guarantee the lowest point. You can settle into a local valley that is not the global one — and where you end up can depend on where you started. For a linear model the landscape is a single bowl, so there is only one minimum and you are safe. For deep networks the landscape is far more rugged, with countless local minima — and yet, in practice, descent with good initialization and a few decades of accumulated tricks (momentum, Adam, learning-rate schedules) reliably finds settings that work. Why it works so well on such ugly landscapes is still partly a research question.

Why it's the whole ballgame

What scaled from this 2017-era house-price example to today's foundation models is not the idea — it is the size. The same loop trains a 3-parameter line and a multi-billion-parameter Transformer; the gradient is just computed across far more knobs, on far more data, with far more compute. Understand "walk downhill on the loss" and you understand the engine under every model in the rest of this blog.

Written originally as part of a 2017 machine-learning primer. Gradient descent itself is much older (Cauchy described it in 1847); what changed is the scale it now runs at. Part of our ML foundations notes.

Gradient Descent: How Machines Actually Learn

A model is just a function with knobs

Measuring "wrong": the loss function

The descent

The catch: which valley?

Why it's the whole ballgame

SUPPORT

LINKS

INFORMATION

ABOUT US

FOLLOW US

A model is just a function with knobs

Measuring "wrong": the loss function

The descent

The catch: which valley?

Why it's the whole ballgame

SUPPORT

LINKS

INFORMATION

ABOUT US

FOLLOW US

Cookie Settings

Essential Cookies

Analytics Cookies

Functionality Cookies

Targeting Cookies