You can tell when a model is about to fail

Paper 2E · Pødenphant Lund (2026) · Read on Zenodo

You can see the model wobble before it breaks.When you train a language model a little too long on a narrow task, it can quietly lose the ability to reason while its test scores still look perfectly fine. The usual warning lights stay green right up to the crash. But the model itself shows its hand first: if you watch how torn it is between possible next words, you can see it start to wobble several steps before its answers ever get worse. That wobble can be turned into an early-warning system.

The score lies to you

When people check how good a model is, they look at the output: did it get the answer right, what is the test score, what is the benchmark number. That number is honest but slow. It only changes after something has already gone wrong inside the model, and it can be fooled. If the test questions are too easy, or the model is already at the top of the scale, the model can change a lot on the inside while the score on the outside does not budge at all.

This is a real headache when you fine-tune a model: you take a general model and train it on your own narrow task. You want to stop at the sweet spot: after the model has learned your task, but before it has learned it so hard that it forgets how to do everything else. Train past that point and the model "over-fits": it nails your narrow task while its broader reasoning crumbles. And the worst part is that the usual signals, the training loss and the test score on your task, keep looking good through the whole disaster, because the model really is getting better and better at your one narrow thing.

Read the friction, not the score

Every time a model writes a word, it is choosing between candidates. Sometimes one word is the obvious winner and the rest are nowhere close. Sometimes several words are neck-and-neck and the model is genuinely torn. That tension between competing options is what friction theory calls friction, and you can read it straight off the model's own internal numbers without asking it anything extra. It is essentially free.

Crucially, friction is a property of what is going on inside the model, not of the word it finally picked. So it can tell you about the model's internal state even when the final answers still look fine. It carries information the score does not.

Spotting the crash before it happens

The team fine-tuned several small models on a deliberately narrow task (answer arithmetic with no working-out shown) and watched both the test score and the friction the whole way through. What they found has a familiar shape from other parts of science: an over-fitting collapse behaves like a tipping point, and tipping points announce themselves in advance.

Just before a system tips over, it starts to "slow down": its little fluctuations get bigger and last longer, like a stressed system that takes longer to settle after each nudge. Ecologists use exactly this to predict when a lake or a climate system is about to flip. The same fingerprint shows up here. In the steps before the model's reasoning collapsed, its friction got noticeably more jittery, and crucially, this happened while the test score was still flat and looking healthy. Then the model tipped: it locked onto the narrow template, the friction at the first word dropped to zero, and its reasoning died.

So the friction starts shouting several steps before the score even whispers. You can watch the model become torn about how to begin its answer (the early-warning), and only later does it stop reasoning altogether (the crash). The warning comes first. And this fingerprint showed up in three different model families, not just one, so it is not a fluke of a single model.

Turning it into a warning light that calibrates itself

An early-warning is only useful if it works on a fresh model you have never seen, without you hand-tuning it first. The clever bit here is that the warning calibrates itself. Instead of a fixed number that has to be re-tuned for every model, the monitor watches each run's friction against that same run's own calm starting point, and raises a flag when the jitter climbs above a set multiple of it. Because the comparison is always against the model's own baseline, it needs no separate setup pass and no magic number ported from one model to the next.

The flag does not slam on the brakes; it just turns on closer monitoring and says "a collapse looks imminent". That makes it safe to be a bit sensitive: a false alarm costs you a little extra checking, not a wrongly cancelled training run. Run across four models at once, the monitor caught the collapse in all three that actually collapsed, and a built-in check correctly skipped the fourth, where the test was too hard to be informative in the first place.

Why the normal warning light fails

You might ask: doesn't the usual over-fitting check already cover this? It watches the model's score on a held-out slice of the training task. But that is exactly the wrong place to look here. The model is not over-fitting to the training task. It is getting better at that. What it is losing is a different ability, reasoning, that the training task never measured. So the in-task score keeps rising straight through the collapse and never warns you. The friction, watching the model's actual reasoning, is the one signal that moves with the failure.

Knowing when an instruction will help

The same friction reading, used a different way, answers a separate practical question: when is it worth giving a model extra instructions or a smarter prompt? Read friction across models of different sizes and they sort into two camps. Some are "pressed": full of internal tension, and a good instruction can tip them toward the right answer. Others are "frozen": so locked in that no instruction changes anything. The friction reading tells you which camp a model is in, so you know in advance whether a clever prompt will earn its keep or do nothing. It is the same instrument as the early-warning, just pointed at a different question.

Why this matters

For anyone training models, this is an early-stopping signal for a failure that the usual tools miss entirely. The standard signals stay green while general capability quietly erodes; the friction signature lights up before the damage shows in the score. Read the substrate, and you get a warning while you can still act on it.

The cite

Pødenphant Lund, T. (2026). Reading the substrate, not the score: a self-calibrating friction early-warning. Zenodo. https://doi.org/10.5281/zenodo.20562090

Read on Zenodo → · Technical version · Dansk version

Related on this site: