← All articles

Running AI Models Fully On-Device in 2026

On-device AI · 2026 · revisiting a survey we first wrote in 2018

Back in 2018 we surveyed the state of "mobile deep learning," and the framing that survey used has aged better than any of the frameworks in it. It split the world in two: the online way — ship your data to a server to run the model — and the offline way: run the model on the device, where the data already is. Offline meant privacy and no network dependency. In 2018 it also meant tiny models. In 2026 it does not, and that single change is the whole story.

What "on-device" meant in 2018

The frameworks of the moment were Caffe2 (Facebook), TensorFlow Lite (Google), and Core ML (Apple), with newcomers like Baidu's MDL alongside. They shared a model: train in the data centre, then export a slimmed-down model to run predictions on a phone. The bet was explicitly about privacy — keeping user data on the device — but the reality was modest: small vision and audio models, heavily trimmed, doing narrow jobs. Anything ambitious still phoned home.

What changed

Three things moved at once, and together they moved the frontier from "toy models" to "real ones."

Runtimes matured and converged. ONNX Runtime became a near-universal way to run a model anywhere; llama.cpp / GGUF made large language models run on plain CPUs and consumer GPUs; whisper.cpp did the same for speech; Core ML, LiteRT (the former TensorFlow Lite), ExecuTorch, and MLC-LLM rounded out the field. Exporting a model to the edge stopped being a research project.

Hardware caught up. Neural accelerators (NPUs) are now standard silicon — Apple's Neural Engine, Qualcomm's Hexagon, the NPUs in mainstream "AI PCs" — and where there's a GPU, cross-platform offload via CUDA, Metal, DirectML, and Vulkan puts real compute behind local inference.

Models got smaller without getting dumber. Quantization (8-bit, 4-bit, and the GGUF k-quant family), distillation, and low-rank adapters shrank models by an order of magnitude while keeping most of their quality. A capable model that needed a server in 2020 fits in a laptop's memory in 2026.

What runs locally now

The list is no longer "small classifiers." Quantized language models in the billions of parameters, speech models like Whisper-large, image generation, embedding and retrieval for local search — all run on a normal laptop, and a surprising amount runs on a phone. The offline column of that 2018 table quietly absorbed most of what used to require the online one.

Why the privacy bet finally pays off

In 2018, choosing on-device meant accepting a weaker model in exchange for privacy. That trade-off is what has dissolved. You no longer give up much capability to keep your data on your machine, which means the privacy-first architecture is now also, often, the better architecture — faster (no round trip), more reliable (works on a plane), and compliant by construction (nothing to leak because nothing leaves). This is the same arc the Whisper piece traces for speech specifically: the data-centre model became a download.

It is also, plainly, the bet our products are built on. The 360Converter Offline Transcriber runs speech recognition entirely on your machine for exactly this reason — not as a limitation, but because in 2026 there is no longer a good reason to send sensitive audio anywhere else.

Where it's heading

The pressure runs one direction: smaller, faster, more local. NPUs keep getting bigger, quantization keeps getting better, and the default for a new AI feature is shifting from "call an API" to "run it here." The 2018 survey guessed that offline would matter for privacy. It was right — it just took the models a few years to make the privacy free.

This is a 2026 rewrite of a survey we originally published in 2018, when Caffe2, TensorFlow Lite, and Core ML were the state of the art. Framework names and capabilities change quickly; the on-device direction has not.