Findings & new explanations — in plain English

New findings in language models, and what they explain about us

Language models have a small tell that gives away when they are about to answer wrong. It is like the gut feeling you get just before you guess. The best part is that the signal is free: the model gives it away already. Used well, it can make an open model guess better than GPT-4, for under ten kroner to set up. It is the most concrete finding here, but far from the only one.

There are three kinds of finding here: new measurements (things seen for the first time), new explanations for known patterns (a mechanism instead of a label), and the methods that make the measurements possible.

Empirical discoveries

Three universal dimensions of friction

When you do statistics on friction signals from 15 different language model architectures, you get exactly three independent axes: magnitude, distribution, rhythm. The first axis is essentially identical across all models. This means the three dimensions are not a property of any specific model, but of race architecture itself.

Surprise drives attention in language models

I have measured a correlation between how "surprised" a language model is at a particular word, and how much attention later words pay to it. This is the artificial-substrate version of the same pattern found in the hippocampus, surprise-driven replay. Measured directly in an artificial system for the first time.

Language models often "know" the answer but choose wrong

When language models answer incorrectly, the right answer is often on their short list of candidates, just not at the top. This is not lack of knowledge; it is a choice error. I call it the friction ceiling: a limit on what the signal can achieve, because it measures cost, not correctness.

A free signal that improves any LLM by 12-21 percentage points (combined pipeline)

The most practical discovery. I use the language model's own uncertainty signal (free from the API) to choose when the model should reconsider, and when it should just abstain. The strategy alone gives +7.7 to +20.8 pp on four of five cells; combined with calibrated abstention, the four cells where both were measured reach +12 to +21 pp. On SimpleQA, the combined pipeline lifts Qwen3-235B past GPT-4o and GPT-4.1. Calibration costs about ten kroner per setup. Works on any model with a standard API.

Assistant-training measurably "washes out" friction signals

When a language model is trained to be an assistant (the way ChatGPT is), its friction signal can become markedly flatter on hard tasks. On Llama-3.1-405B, about two thirds of the base-model variation disappears, down near the floor where there is almost no friction left to measure. The effect varies with the task: strong on hard reasoning, milder on the easier ones.

Break the format and the model measurably resists

If I tell a language model "answer in format A" but show it examples in format B, it follows format A 100% on the surface. But it pays a high per-word cost "resisting" the format it was shown. On Llama-3.3-70B (150 responses) accuracy drops from 70% to 48%. The model obeys but resists: it is visible in the friction signal even when it does exactly as told on the surface.

New explanations for known phenomena

Loss aversion isn't psychology, it's math calibrated by lifespan

Kahneman and Tversky showed we fear a loss about twice as much as we enjoy an equal gain. But why? My account: a limited lifespan pushes you to commit early, before you have sampled enough, because committing too late can be costly when time runs out. That early commit is loss aversion. Language models have no lifespan to lose, so they commit later (43-48% of the way through, on the models I tested), the opposite way. That they go the opposite way is the point: it shows loss aversion is not an isolated flaw in humans, but what happens when a limited lifespan pushes you to choose too early.

Hysteresis is the precondition for learning, not an error

Hysteresis, a system carrying traces of its own history, has traditionally been treated as an error or side-effect. My framework flips this: hysteresis is the structural precondition for learning to happen at all. A system that bears no trace of its history cannot learn. This applies to brains, neural networks, and physical systems with memory.

Cognitive biases are thermodynamically necessary

Anchoring, confirmation bias, sunk cost, framing effects, the classical cognitive "errors", are not failures of reasoning. They are necessary consequences of any computational architecture that must choose under constraint. An "Econ" (perfectly rational agent without bias) is thermodynamically forbidden in any physical system. Bias is the price of being able to make decisions at all.

Language models show both surprise and resistance

Language models do get surprised: words they did not see coming draw measurably more attention, and breaking the format makes them measurably resist. Both effects are measured directly. The still-open question is not whether they react, but whether they can tell the two apart: whether there is a layer that decides if a friction event should make the model change its mind (a source it trusts) or dig in (a source it does not). That layer has not yet been seen in a language model.

"Catastrophic forgetting" is not damage, it is signal redistribution

When language models are fine-tuned on new tasks, they often lose their original capacities. This has been called "catastrophic forgetting", and read as the model being "damaged". A reverse test falsifies that reading: if you remove the added layer afterward, the original performance comes back 100%. The knowledge was intact all along, just outranked. The mechanism is a redistribution of the model's signal budget, not damage. The one explanation pulls together six phenomena that were seen as separate before.

Race architecture turns up in wildly different systems

Seven seemingly unrelated phenomena (from quantum physics and Ohm's law to chemical kinetics and the Yerkes-Dodson curve in psychology) look like expressions of one and the same necessity: choosing under pressure. The claim is a shared lens, not that the systems are identical; they may share the same race-structure while differing in material. The pattern spans many orders of magnitude in time. Here is how to falsify it: find a system with race architecture that does not show the characteristic inverted U-curve.

Methodological innovations

CR — a free signal from language models

Competing Routes (CR) is the count of high-probability alternatives at each token position. It is free from the API, works on any model, and correlates with errors. CR is the operational handle that makes substrate-universal friction measurable in artificial systems.

Frontloaded ICL instead of fine-tuning — the practical shortcut

For encoding studies, fine-tuning has been the standard method, expensive and slow. Putting all the examples into the prompt instead (and then asking one question) can replace fine-tuning in many cases. Fast (~5 seconds vs hours), cheap (cents vs dollars), and uniform across model families. Credit note: I came to this approach independently out of frustration with fine-tuning's slow turnaround, then learned others had used variations before me. Not original; only one of the methods the empirical programme depends on.

Calibrated abstention via friction signal

A language model can learn to say "I don't know" based on its own friction signal. This adds +6.5 to +14.1 percentage points to success rate at 20% abstention, for free. Combined with strategy correction, it produces a lift bigger than the two add up to on their own.

One of the most striking findings has its own plain-English walkthrough: Why "knows little, believes a lot" shows the Dunning-Kruger curve measured directly in four language models.

Numbers, references and tables are in the technical version: findings (technical).