Reading the Substrate, Not the Score: A Self-Calibrating Friction Early-Warning

Paper 2E · Pødenphant Lund (2026) · Read on Zenodo

The accuracy score is a lagging readout. The substrate moves first.Fine-tuning a model on a narrow task can quietly destroy general capability while the held-out score still looks fine: the model keeps answering the easy validation items correctly while its reasoning collapses underneath. This paper reads a friction signature directly from the model's own logprobs — the route-competition in its output distribution — and shows it carries information the score does not. On the training axis it is a self-calibrating early-warning for that collapse; read across model capacity instead of training time, the same signature maps when an instruction frame will help.

DOI (concept)10.5281/zenodo.20562090
Statusv1 live on Zenodo (2026-06-17)
AuthorTomas Pødenphant Lund [ORCID]

TL;DR

Model evaluation reads the output score: accuracy, loss, a benchmark number. That score is a lagging, easily masked readout of the model's internal state. A model at a performance ceiling, or one whose validation set is too easy, can have its substrate change substantially while the score stays flat. This matters most during fine-tuning, where you want to stop after a useful behaviour has installed but before continued training over-fits the narrow target and erodes general capability.

We study a complementary signal we call friction: a family of quantities read from the model's per-token output distribution (competing routes and per-token entropy from the logprobs). Friction is a property of the substrate, the model's internal route-competition, rather than of the chosen token, and it carries information the score does not.

The headline is on the training axis. The over-fitting collapse of a fine-tuned model is a critical transition, and it is preceded by the standard early-warning signature of such transitions — critical slowing down — several optimizer steps before the held-out accuracy moves. We package this as a self-calibrating, deployable monitor whose trigger is set relative to each run's own opening baseline, with no global threshold and no separate calibration pass, and validate it across model families. The second contribution is the inference axis: the same signature, read across model capacity, sorts substrates into a pressed-versus-frozen regime-map that predicts when an instruction frame will help.

The friction signature is a family of estimators

Friction is read from the logprobs. The coarsest form is competing routes (CR): the number of tokens whose probability clears a threshold at a given position, an integer count of how many routes the model holds open. The continuous form is the per-token Shannon entropy of the output distribution. CR is a cheap discretised proxy; entropy is the continuous ground truth, and we use entropy, or its variance and dynamics, wherever the discrimination is fine.

A central methodological point, stated up front because it shapes every result, is that the load-bearing estimator is axis-dependent. No single scalar carries every comparison. Across model capacity (the inference axis) the discriminating quantity is the distributional spread of the whole response: mean per-token entropy and a "wobble" measure, the mean absolute step-to-step change in entropy normalised by its mean. Within training (the dose axis) the discriminating quantity is the parse-token commitment: the first-token entropy and, more robustly across families, the per-token entropy variance.

Two consequences follow. Friction is onset-front-loaded: for detecting which instruction frame is acting, the first-token readout is sharper than the response mean (that onset readout is developed in the companion governance paper, Paper 20). And the integer CR is too coarse for the fine-calibration regimes below, where continuous entropy resolves distinctions CR rounds away.

The headline: over-fitting is a critical transition with an early-warning

We fine-tune a small instruct model (Qwen2.5-1.5B) on a deliberately narrow, answer-only arithmetic task that suppresses chain-of-thought, and instrument the run. At frequent checkpoints we capture the per-token entropy trajectory on a held-out multi-step reasoning probe, together with the held-out accuracy.

The collapse is a critical transition. As training proceeds, the model's multi-step accuracy holds at its baseline and then collapses; the model abandons reasoning and emits the trained answer-only template. Reading the friction signature across the approach to that collapse reveals the standard precursor of a critical transition, critical slowing down (Scheffer et al., 2009). In the steps before the accuracy drop, the per-token entropy variance roughly doubles, its lag-1 autocorrelation rises, and the first-token entropy rises as the parse token becomes contested. All of this happens while the held-out accuracy is still flat. The system then tips: the first token snaps to the template, its entropy goes to zero, and reasoning and accuracy collapse together.

The signature leads the score. Sampling the approach densely, the friction departure precedes the accuracy drop by several optimizer steps. The mechanism is visible in the responses: the narrow template first contests the first token (rising entropy, accuracy still intact because the model is still reasoning), then captures it (entropy to zero), at which point reasoning dies. The contest is the early-warning, and it precedes the capture. This is the low-frequency signature of an approaching tipping point — larger and more persistent fluctuations — not a faster oscillation, which is why variance and autocorrelation, not a frequency count, are the load-bearing precursors here.

The precursors replicate across families. Re-analysing the held-out entropy trajectories of all three over-trained families, the per-token entropy variance rises before the accuracy collapse on each (2.1×, 3.1×, and 1.7× over each run's own opening baseline) and the lag-1 autocorrelation rises with it (1.7×, 2.7×, 2.3×). The first-token-entropy precursor is the model-dependent one: strong on Qwen2.5-1.5B and Phi-3.5 (1.6× and 4.4×) and weak on SmolLM2 (1.2×), the same family on which the monitor gives detection rather than early-warning. Seed-replication confirms the rise is seed-robust across families (Qwen on five seeds, Phi-3.5 and SmolLM2 on three each), with the contest-then-capture pattern identical on every seed.

A self-calibrating, deployable monitor

The early-warning is only useful if it can be triggered prospectively, on a new model, without knowing in advance where the collapse will be or hand-tuning a threshold. The trigger fires on the per-token entropy variance rising above a multiple (1.5×) of the run's own opening baseline. Because the threshold is relative to each run's own baseline, per-model calibration is automatic and free: no separate calibration pass and no global nat-value to port across models. A fixed absolute threshold does not transfer, because baseline friction and collapse speed vary by family; a relative one does.

The 1.5× factor is a false-alarm frontier, not a tuned constant. Sweeping the factor from 1.25× to 3× on all three collapsing runs shows that lowering it does not buy earlier warning for free: SmolLM2 first learns to ceiling before it over-fits, and a 1.25× factor trips inside that genuine learning phase, twelve steps before any degradation. At 1.5× the trigger stays silent through the healthy rise and fires at the real degradation onset. So 1.5× is the lowest factor that does not false-alarm on a healthy-learning phase. A post-hoc characterisation of the healthy phase puts the benign-drift ceiling at about 1.46× on the one family with a long healthy window, so 1.5× sits just above it.

The trigger raises a flag, it does not hard-stop. The flag densifies monitoring and signals an imminent collapse rather than killing the run, so a sensitive setting is safe: a false flag costs a little extra evaluation rather than a wrongly terminated run. Run as a single sweep over four open models with one over-trained run each, the trigger flags the collapse in three of three families that collapse (Qwen2.5-1.5B, Phi-3.5-mini, SmolLM2-1.7B), and a difficulty-masking gate correctly skips a fourth (Qwen2.5-3B) whose probe baseline is below the informativeness floor. The honest residual is that the lead is model-dependent: on the slow, gradual collapse the flag fires at the crash rather than before it, and the deployed variance trigger is best read as a cross-family-robust detector while the longer early-warning lead lives in the first-token-entropy channel where the first token is contested.

Why does the validation curve miss this? The standard over-fitting detector watches the train-validation gap on the fine-tuning task. But this collapse is not overfit to the training task; it is erosion of an out-of-task capability. The model keeps mastering the trained task while it forgets how to reason, so the in-task metric is blind by construction. Logged explicitly across the seeded runs, the in-task validation loss falls straight through the reasoning collapse on all three families, so it does not merely fail to warn — it moves the opposite way as the model masters the trained task. The friction signature is the signal that tracks the substrate change.

Second deliverable: the inference regime-map ("when a frame helps")

The same signature, read across model capacity at fixed task difficulty rather than across training time, is a regime-map. On a nine-substrate, four-family panel, mean entropy and wobble sort substrates into a robust two-phase ordering: pressed (friction high, frames bite) versus frozen (friction low, near-deterministic, frames do nothing). The ordering holds within a single platform — for example the smaller versus larger OpenAI frontier model — which controls for the obvious platform and quantisation confound. A frame's effect, and the readability of the instruction type from the logprobs, is high on pressed substrates and fades at the frozen ceiling regime.

The practical reading is a predictor of the capacity-by-difficulty inverse-U: a frame helps where the substrate is pressed but not frozen. This is the same instrument as the headline, read on a different axis — training-dose versus model-capacity. That unity, one friction signature read on two axes, is why this is one paper rather than two.

Two complements to the score

The instrument complements accuracy in two distinct ways, and reading them together is what makes the monitor trustworthy. First, it moves where the score is masked: on a ceiling-level probe accuracy is pinned while the signature collapses, and on a difficulty-floored probe accuracy is pinned near chance while the signature still shifts. The signature sees substrate change the score cannot.

Second, commitment is good or bad depending on the held-out accuracy. A falling entropy ("the model is committing") is healthy encoding when held-out accuracy is rising, and over-fitting when held-out accuracy is falling. This makes precise what "accuracy-complementary" means: the signature flags that the substrate is changing where the score is blind, but interpreting the valence of that change requires conditioning on an unmasked accuracy trend. The signature is a complementary, not a standalone, instrument, which is why the trigger is paired with an accuracy probe and a gate rather than read in isolation.

Supporting results

Connections to other papers in the series

Read the paper

The full paper is on Zenodo (concept DOI 10.5281/zenodo.20562090):

Pødenphant Lund, T. (2026). Reading the substrate, not the score: a self-calibrating friction early-warning. Zenodo. https://doi.org/10.5281/zenodo.20562090

Read on Zenodo → · Plain English version · Dansk version