You don't learn facts, you learn the track they cut

Paper 4B · Pødenphant Lund (2026q) · Read on Zenodo

I study language models to understand people.Give a large language model a maths problem it gets right 75% of the time, then show it one example of how to solve that kind of problem. Its accuracy drops to 52%. That is the classic expertise reversal effect from educational psychology, and it shows up in a neural network that was never built to repeat it. You don't learn the facts you were handed. You learn the work your brain had to do with them. If the material slides past with no resistance, it leaves no trace. If it forces you to solve something, the solution is what gets remembered.

The puzzle

I was asked to build teaching material about malnutrition in elderly care. The brief was to train care staff in the nine clinical signs of malnutrition, and at the same time in all the illnesses that can lead to it. In short: give the user a great deal of information.

I worked on it. It was hard. However we turned the material over, there was a great deal of content, and the content didn't tell anyone what to do. The audience (the care staff) already gets a great deal of information from many directions. Adding nine new points to remember wouldn't help.

The breakthrough came when we stopped asking "what do we want them to know?" and started asking "what do we actually want them to do?" The clinical guidelines already specified that care staff should offer to weigh residents every month. It just wasn't carried out consistently, because the responsibility was diffuse and the action wasn't built into the routine.

The redesigned material was one sentence: "Remember to weigh, that's good care. If you notice a weight loss of more than one kilo, you need to act." A campaign was built around that one sentence. The detailed information about the nine signs and the illnesses behind them stayed on the back page, there when it was needed. The training went from 15 to 20 minutes of information to about 3 minutes of action.

The original brief had been optimised for the wrong thing. It packaged up the sender's completeness without considering what the recipient could do with it. The redesign was optimised for what the recipient could actually act on. Completeness is a property of the sender. Learnability is a property of the recipient. They are not the same axis.

The unexpected evidence: language models do it too

Take a large language model. Give it a simple chemistry composition task: "Substance A has a rate of 3.7% per hour. Substance B has a rate that is 3.12 times A's. What is B's rate?" With no examples, the model answers correctly about 75% of the time. Add one in-context example showing how this kind of problem is solved. Accuracy drops to 52%. Add three examples. It recovers partway, to 61%.

This is the classic expertise reversal effect from educational psychology (the instructional support that helps novices often hurts experts), now showing up in a neural network. The smaller models (the 7B class, the "novice tier") don't show this dip. The smallest model (1.5B) can't even solve the task; its substrate is too limited. Only the competent substrate gets confused by being shown how.

Why? When the model already has a working strategy for the problem, a demonstration that doesn't match that strategy doesn't add new information. It opens a competing strategy. The substrate now has to keep both strategies alive and decide between them. That settling costs computational bandwidth. The model can see two ways forward, and the competition between them shows up as a measurable signal in its output: more candidate tokens stay in contention at each step. Friction theory calls this signal competing routes, and it can be read straight off any language model's output.

What the paper found

Eight experiments on six language models (Qwen2.5 1.5B, 7B, 32B, Llama-3.3-70B, Qwen3-235B, DeepSeek-V3) on the same chemistry composition task. The main findings:

1. The U-curve depends on substrate capacity

Same task, different model sizes, qualitatively different patterns. The smallest model (1.5B) goes flat on the substrate floor: it can't run a strategy race, because it doesn't have more than one strategy. The middle model (7B) shows a steady gain from demonstrations: the novice tier behaves the way classic instructional psychology predicts. The large model (70B) shows the expertise-reversal dip. That is the same shape educational psychology finds in human experts: support that helps novices hurts experts.

2. The friction signal peaks at the strategy switch

On the 70B model the per-token competing-routes signal is highest at 1-shot (1.114), lower at 0-shot (1.052), lower again at 3-shot (1.073). The friction is observable in the logprobs, not just in the accuracy loss. The model is visibly struggling, position by position, with several strategies alive at the point where one demonstration creates ambiguity.

3. Clearer demonstrations help: the obvious prediction was wrong

The naive prediction: fuller demonstrations add load and ought to make things worse. The opposite happened. When the demonstration showed the working that led to the answer (not just the answer alone), the competing-routes friction fell (1.077 against 1.138) and accuracy rose 16 percentage points. The clearer demonstration closed the strategy race sooner. Clarity reduces friction; ambiguity keeps it going.

4. Format mismatch is reactance

When the system message asked for <result> output and the demonstrations used <answer>, accuracy collapsed from 70% to 48%. The model followed the system instruction in every condition, but it paid a friction cost by actively rejecting the demonstration's format all the way through its answer. This is structurally like human reactance: ask a child not to think about a pink elephant, and you have added the very route you were trying to prevent.

5. Noise is cheap, meaning is expensive

Random gibberish filler in the prompt costs only 13 percentage points of accuracy across a 1600× increase in volume. Semantically plausible elaboration costs 20 percentage points at 60× less volume. The cost isn't in the number of tokens. The cost is in whether the substrate has to spend processing on settling the material. Noise habituates as non-signal. Plausible-but-ambiguous content can't be filed away, and it keeps costing.

6. Substrates that can't remember can't learn across sessions

Language models with no persistent-memory architecture show no encoding gain from within-session repetition. They pay friction inside a single generation, but the trace doesn't carry across calls. This is structural, not a bug: humans have a persistence layer that lets the friction trace build up (memory, sensitisation, eventually skill); stateless LLM inference doesn't. The race mechanism is shared across substrates; the persistence layer is what tells them apart.

7. Capacity has a hard ceiling, and crossing it collapses the curve

The 32B base model has a clean sweet spot for competing-routes activity at 2 facts. Push the task pressure past the substrate's headroom and the sweet spot disappears: the friction sits flat on an overload floor across every condition, and accuracy collapses to zero. The substrate's settling bandwidth is a hard ceiling, not a soft preference.

8. Different triggers may leave different signatures

An exploratory pattern that awaits preregistered confirmation: different cognitive triggers seem to leave different position signatures in the answer. Reactance and strategy ambiguity tend to peak at position 5 (the point where the answer's value is settled on). Over-explanation and closing uncertainty tend to peak at positions 0 and 9 (structural decision points). If it holds up, it would mean you could classify the type of cognitive friction from the first ~10 tokens of any LLM answer. Not a deployable tool yet. A hypothesis worth testing.

Why it matters

For teaching. Designing materials around what is most complete is a recipe for materials that don't teach. The teacher's job is to design conditions under which the learner's substrate runs the right race, not to deliver the most complete message. Friction theory makes this concrete: the friction is what gets encoded, so design for the friction profile you want. This is close to a long-standing idea in memory research, encoding specificity and transfer-appropriate processing (Tulving): what you store is the processing you did, so it serves you best when the later situation calls for the same processing.

For prompt engineering. On competent models, fewer-but-clearer demonstrations beat more-but-ambiguous ones. The 1-shot dip on 70B-class models is structural, not a quirk of one particular model. When you design prompts, ask whether the example closes a strategy race or opens one.

For language models as cognitive models. Critics often say LLMs can't be cognitive models because they're "just probabilistic." This paper points the other way. The same expertise-reversal pattern, the same friction signature, the same closes-versus-opens dynamic show up in language models that were never designed to reproduce these effects. The mechanism that race-resolution-under-capacity-limit produces in transformers is the same mechanism human teachers have wrestled with for decades.

For the cognitive sciences. Cognitive Load Theory has been the dominant framework for fifty years and has been productive, but it specifies no mechanism beneath the load construct. The recursive race account does. Where CLT predicts symmetric novice-vs-expert effects, the most recent meta-analysis (Tetzlaff et al. 2025) finds asymmetric magnitudes, consistent with a directional mechanism rather than a symmetric load curve.

What I don't know

This is language models, not people. The same pattern shows up in both, and that is exactly why it's interesting, but it's a parallel, not proof that human memory works by precisely the same mechanism. Showing that would take measurements in biological systems, and I haven't made those here.

The eighth finding, that different kinds of friction seem to peak at different positions in the answer, is exploratory. I have seen the pattern, but it awaits a preregistered replication before I'll stand behind it. It's a hypothesis worth testing, not a tool you can pick up tomorrow.

And the result about stateless models not learning across sessions doesn't say language models can't remember. It says the version I tested lacks the persistence layer that lets the trace build up. That is a property of the architecture, not a limit on the mechanism itself.

The cite

The full paper is open-access on Zenodo. Concept DOI:

Pødenphant Lund, T. (2026q). Substrates Encode Experience, Not Information: An Encoding-through-Loading Framework with Cross-Substrate Tests in Language Models. Zenodo. https://doi.org/10.5281/zenodo.20059861

Read on Zenodo → · Technical version · Dansk version

Related on this site:

Paper 2B (ICL/FT memory) — the same mechanism at training time: why fine-tuned models hallucinate confidently.
Paper 2 (Capacity scaling) — the cloze-vs-application asymmetry across model sizes, which 4B's substrate-graded patterns extend.
Paper 1 (FT) — the foundational paper that defines friction as competing routes.
The Memory page — the broader treatment of why information-dumping does not teach.
The Learning page — framework-level treatment of encoding-through-loading.