Learning — what the framework predicts and finds

A track that gets carved when an answer wins, not a module in the brain

Too easy or too hard, the track stays shallow; in the middle it carves deepest

We tend to think of learning as something special, a storehouse in the brain where things get filed away. But that is not how it works. Learning is just what happens when several possible answers compete to win, and the winner leaves a track. We can watch this directly in a language model, where the tracks are easy to measure. Next time the track is already there, and the answer is easier to find. That is the whole mechanism. And it explains something strange: a system that keeps no trace of its own past cannot learn at all. The tracks are not a flaw. They are the very precondition.

Hysteresis is the precondition for learning

Hysteresis (a system carrying traces of its own history) has traditionally been treated as an error or side-effect to be minimised. The framework flips this: hysteresis is the structural precondition for learning in any bounded probabilistic substrate. In a substrate that bears no trace of its history, learning does not occur. Path-dependent state is what makes learning structurally possible.

This applies equally to:

Biological brains — synaptic weights change because activity leaves a trace
Artificial neural networks — weights update because the loss-trajectory is path-dependent
Physical systems with memory — magnetisation, glasses, polymer dynamics all show learning-like adaptation

learning.

A system that does not retain this asymmetry between A and B cannot learn. The trace is the memory.

Why information-dumping doesn't teach (in models or in humans)

Here is where the framework gets uncomfortable.

Large language models are literal computers, designed from the ground up to absorb information. They have unlimited patience, perfect recall of whatever is shown to them in a session, no biological limit on attention. If any system could be taught by information-dumping, it would be them.

They can't.

Paper 2B shows this directly. Take 47 invented facts. Fine-tune a language model on them, with 100 epochs of training, with paraphrase augmentation, with everything we know how to throw at it. The result is a model that performs worse on cloze retrieval than one that just has those same 47 facts sitting in its prompt. Worse. After all that training. The cumulative gradient pressure compresses the calibrated distribution; the model commits with high confidence to whatever happened to win each route; and the alternatives (which is to say, the actual learning) get pressed below the noise floor.

If language models, which are designed to be teachable, cannot be taught by information-dumping, the question becomes harder. Why have we assumed humans can be?

The sender-receiver mismatch

The standard model of teaching is: the sender packs information; the receiver receives information. If the receiver doesn't learn, the receiver isn't trying hard enough.

That model is physics-blind.

You do not learn information. You learn the trace information leaves. The trace is the physics. The trace is what gets cut into the substrate. The information itself is just the stimulus that produces, or fails to produce, the trace.

If the trace doesn't get cut, no amount of effort or instruction or motivation matters. You cannot store what you have not traced. It is not a question of will but of which routes got reinforced often enough, under the right conditions of competition and load, to leave a channel in the substrate. The water on the tile floor metaphor again: if you don't drag your finger through it, the channel does not form. Period.

What this implies for teaching

The pedagogical implication is not "throw more information at the student until they learn." It is "design conditions under which the student's substrate cuts the relevant trace." These are different problems. They feel similar from the sender's side, since both involve providing material. They diverge sharply at the substrate level: the first treats the receiver as a passive storage device; the second treats them as a hysteresis-bearing system that has to be made to do something with the material before any trace can form.

This is what Bjork's "desirable difficulties" actually means at the substrate level: difficulty is what raises route-competition enough that the trace gets cut deeply. Cloze tests get cut shallow because they require little competition; application tests get cut deep because they require composition under load. Spacing works because the trace gets re-activated and re-deepened. Interleaving works because it forces routes to compete instead of being pre-sorted.

None of this is motivation. All of it is physics. The teacher who blames "lazy students" is the equivalent of a programmer blaming the laws of thermodynamics for an inefficient algorithm.

Encoding-through-loading

The standard cognitive-science view treats encoding as a separate process from retrieval and decision. The framework collapses this: what gets encoded is what wins route-competition under load. There is no separate encoding module: the same race-resolution machinery that produces decisions also leaves the trace that constitutes learning.

This connects to two classical findings:

Levels of processing (Craik & Lockhart 1972): deeper semantic processing produces stronger encoding. The framework explanation: deeper processing requires resolving more competing routes, leaving a richer hysteresis trace.
Distinctiveness effect / Von Restorff (1933): outliers are remembered better. Tested at the gradient level in fine-tuned LLMs: violations of an implicitly learned pattern produce stronger encoding than conforming instances.

What strengthens the trace: surprise

If one thing decides how deep a trace gets cut, it is surprise. An input that matches what the substrate already expects opens almost no competition and leaves a shallow trace. An input that breaks the expectation forces the substrate to work to resolve the mismatch, and that work cuts deeper.

This is measured directly in language models. Words the model did not see coming draw measurably more attention from the words that follow. And a fine-tuned model encodes a break in a pattern more strongly than an instance that simply follows it. That is the von Restorff effect, seen right down at the gradient level: what stands out sits deeper.

It is also why "desirable difficulties" work. Difficulty is just surprise made systematic: it raises competition, so the trace is cut deep. And it is why variety beats repetition. Train the same 25 facts with four different phrasings instead of one, and reliable recall rises from 38% to 94%, even though the amount of training is the same. Each new phrasing reopens the race instead of re-running one that is already settled.

Language models "learn" differently depending on task type

Two task types on the same knowledge:

Two task types on the same knowledge, tested here on language models of different sizes, where "B parameters" means billions of weights in the neural network (the model's size; e.g., 8B means eight billion, roughly a mid-sized model):

Cloze (recall) — "What is the capital of Denmark?" — saturates early. Most models reach ~90% accuracy by 8B parameters.
Application (chaining facts) — "If the capital lies on Sealand, and you take a train west from there..." — scales monotonically from 2% (0.5B parameters) to 85% (70B — the size of Llama 70B).

Same knowledge, different load. Cloze is indexing-bound; application is composition-bound. The bottleneck migrates with capacity.

Implication for educational science: the same knowledge encoded at different capacity levels supports different task types. A learner who can do cloze cannot necessarily do application; the gap is not motivation, it is composition-bounded computation.

Working memory and long-term memory: two ways to influence the race

There are two ways to get a language model to give a new answer, and they work on completely different parts of the race.

Fine-tuning changes the weights. It changes how the race starts: which routes are already favoured before the input even arrives. That is long-term memory. The disposition is baked into the substrate and stays there between conversations.

In-context learning, putting it in the prompt, does not touch the weights. It does not change how the race starts. It can push a route while the race runs, but a route pushed only by context has a harder time winning than one baked into the weights. It helps most when the base pressure, the weight-encoded resistance pulling toward the old answer, is low to begin with. That is working memory: held for the moment, gone when the conversation ends.

One small detail makes the picture sharp. A base model is the race-start as raw pretraining statistics: the unshaped prior, with no fine-tuning on top. A fine-tuned model has had that race-start reshaped. And that reshaping is exactly why fine-tuned models can be confidently wrong. Pressing the weights toward one answer also compresses the very signal that would have shown the model it was uncertain. The base model still carries that signal; the fine-tuned one has flattened it. That is what Paper 2B measures directly.

"Catastrophic forgetting" is signal redistribution, not damage

Catastrophic forgetting in fine-tuned LLMs has been interpreted as substrate damage: the base model "loses" knowledge during adaptation. This interpretation is empirically falsified.

The reverse-test (v13c, Paper 6 forthcoming): remove the LoRA adapter, and base performance returns to 100% of baseline, a 179.5% recovery relative to the adapter-degraded state. The base substrate is 100% intact; the adapter rebalances which routes win competition, but does not damage the underlying weights.

The mechanism is signal-budget redistribution: under fine-tuning, route-competition shifts toward the new task, away from the original. The original capability is preserved. It is just outranked. Removing the adapter restores the original ranking.

This subsumes six previously-distinct phenomena under one mechanism:

Catastrophic forgetting in continual learning
Long-train mode collapse
Dementia retrieval-failure (preserved-but-unreachable knowledge)
Bjork desirable difficulties
Spaced repetition advantage (in biological substrates)
Bahrick's permastore retention plateau (Bahrick 1984 — long-term retention 3-5 years post-learning)

The design rule: "want less: dilute; want more: protect."

Bjork's desirable difficulties get a mechanistic foundation

Bjork (1994) argued that desirable difficulties (effortful retrieval, spacing, interleaving) produce better long-term retention than easy practice. The framework provides a mechanism: difficulty raises route-competition, which deepens the hysteresis trace, which is what gets retained.

The prediction is testable in artificial substrates: calibrated retrieval-practice should preserve the slope from recognising an answer to settling on it, while calibration-naive training (RLHF-style suppression of friction) should flatten it.

Expertise reversal effect

Kalyuga, Ayres, Chandler & Sweller (2003) found that instructional supports that help novices hurt experts. Worked examples accelerate beginner learning but slow expert performance, because experts have already encoded the pattern and the support now competes with their internal model.

The framework prediction: this should generalise to artificial substrates as a substrate-graded U-curve. Tested across three model sizes:

Qwen2-1.5B: flat at 4-6% — substrate too limited to show the curve
Qwen2.5-7B: monotone gain — novice tier
Llama-3.3-70B: classical U-curve — 73% → 50% → 61% — expert tier shows the reversal (Paper 4b in preparation)

The substrate-graded scope-condition is novel: the U-curve appears only above a capacity threshold; below that, the substrate cannot represent enough alternatives for the conflict to manifest.

What language models cannot test

Several classical learning phenomena are structurally untestable on inference-time LLMs because the substrate lacks features the human version requires:

Spaced repetition — the mechanism requires remembering across sessions. LLMs do not. Testable only via fine-tuning weight drift over training cycles.
Ebbinghaus forgetting curve — same constraint. Requires retention measurement across time-separated sessions.
Cross-session interference — when new learning interferes with previously learned material across sessions. Requires session-to-session memory.

The pattern: between-session memory phenomena require fine-tuning experiments, not inference-time probing. This is a methodological constraint following directly from substrate features.

Implications

For educational science: Bjork's desirable difficulties get a mechanistic foundation. Difficulty is not arbitrary: it is whatever raises route-competition enough to leave a deep hysteresis trace. This predicts which interventions transfer (those that raise route-competition specifically) and which do not (those that just add cognitive load without competition).

For AI training: friction profile during training should predict retention. Calibrated retrieval-practice should preserve the slope from recognising an answer to settling on it; RLHF-style friction suppression should flatten it.

For clinical translation: signal-budget redistribution as a mechanism for retrieval failure is a hypothesis pursued in Paper 8c (forthcoming). The prediction is that parts of dementia may present as failure to commit despite preserved knowledge, the same friction-ceiling pattern observed in LLMs. Diagnostic implication: sub-threshold cuing tests should distinguish encoded-but-unreachable from unencoded. This is a testable prediction, not an established clinical finding.

For compliance and workplaces: the same mechanics explain why information-heavy compliance courses rarely change behaviour. More text does not build the route. I unpack this on Compliance is behaviour, not information.

For Dunning-Kruger: the classic "most certain when you know least" curve falls straight out of the same mechanics. I show it, measured in real language models, on Why "knows little, believes a lot".

For anyone who wants to change something: the same mechanics (build the route in small steps, lower the baseline pressure, avoid all-or-nothing) are gathered as a practical base map on How change works, with concrete pages on getting out of an addiction, a child who can't go to school, and thoughts that go in circles.

If you want the picture underneath all of this, What is a race? starts with water: tanks, pipes, and the channels in the sand that learning leaves behind.

For the full technical version with specific statistics, paper references, and protocol details, see learning (technical).