Learning — what the framework predicts and finds
It is not a module in the brain. It is a track that gets carved when an answer wins
We tend to think of learning as something special, a storehouse in the brain where things get filed away. But that is not how it works. Learning is just what happens when several possible answers compete to win, and the winner leaves a track. We can watch this directly in a language model, where the tracks are easy to measure. Next time the track is already there, and the answer is easier to find. That is the whole mechanism. And it explains something strange: a system that keeps no trace of its own past cannot learn at all. The tracks are not a flaw. They are the very precondition.
Hysteresis is the precondition for learning
Hysteresis (a system carrying traces of its own history) has traditionally been treated as an error or side-effect to be minimised. The framework flips this: hysteresis is the structural precondition for learning in any bounded probabilistic substrate. In a substrate that bears no trace of its history, learning does not occur. Path-dependent state is what makes learning structurally possible.
This applies equally to:
- Biological brains — synaptic weights change because activity leaves a trace
- Artificial neural networks — weights update because the loss-trajectory is path-dependent
- Physical systems with memory — magnetisation, glasses, polymer dynamics all show learning-like adaptation
Why information-dumping doesn't teach (in models or in humans)
Here is where the framework gets uncomfortable.
Large language models are literal computers, designed from the ground up to absorb information. They have unlimited patience, perfect recall of whatever is shown to them in a session, no biological limit on attention. If any system could be taught by information-dumping, it would be them.
They can't.
Paper 2B shows this directly. Take 47 invented facts. Fine-tune a language model on them, with 100 epochs of training, with paraphrase augmentation, with everything we know how to throw at it. The result is a model that performs worse on cloze retrieval than one that just has those same 47 facts sitting in its prompt. Worse. After all that training. The cumulative gradient pressure compresses the calibrated distribution; the model commits with high confidence to whatever happened to win each route; and the alternatives (which is to say, the actual learning) get pressed below the noise floor.
If language models, which are designed to be teachable, cannot be taught by information-dumping, the question becomes harder. Why have we assumed humans can be?
The sender-receiver mismatch
The standard model of teaching is: the sender packs information; the receiver receives information. If the receiver doesn't learn, the receiver isn't trying hard enough.
That model is physics-blind.
You do not learn information. You learn the trace information leaves. The trace is the physics. The trace is what gets cut into the substrate. The information itself is just the stimulus that produces, or fails to produce, the trace.
If the trace doesn't get cut, no amount of effort or instruction or motivation matters. You cannot store what you have not traced. It is not a question of will. It is a question of which routes got reinforced often enough, under the right conditions of competition and load, to leave a channel in the substrate. The water on the tile floor metaphor again: if you don't drag your finger through it, the channel does not form. Period.
What this implies for teaching
The pedagogical implication is not "throw more information at the student until they learn." It is "design conditions under which the student's substrate cuts the relevant trace." These are different problems. They feel similar from the sender's side, since both involve providing material. They diverge sharply at the substrate level: the first treats the receiver as a passive storage device; the second treats them as a hysteresis-bearing system that has to be made to do something with the material before any trace can form.
This is what Bjork's "desirable difficulties" actually means at the substrate level: difficulty is what raises route-competition enough that the trace gets cut deeply. Cloze tests get cut shallow because they require little competition; application tests get cut deep because they require composition under load. Spacing works because the trace gets re-activated and re-deepened. Interleaving works because it forces routes to compete instead of being pre-sorted.
None of this is motivation. All of it is physics. The teacher who blames "lazy students" is the equivalent of a programmer blaming the laws of thermodynamics for an inefficient algorithm.
Encoding-through-loading
The standard cognitive-science view treats encoding as a separate process from retrieval and decision. The framework collapses this: what gets encoded is what wins route-competition under load. There is no separate encoding module: the same race-resolution machinery that produces decisions also leaves the trace that constitutes learning.
This connects to two classical findings:
- Levels of processing (Craik & Lockhart 1972): deeper semantic processing produces stronger encoding. The framework explanation: deeper processing requires resolving more competing routes, leaving a richer hysteresis trace.
- Distinctiveness effect / Von Restorff (1933): outliers are remembered better. Tested at the gradient level in fine-tuned LLMs: violations of an implicitly learned pattern produce stronger encoding than conforming instances.
Language models "learn" differently depending on task type
Two task types on the same knowledge:
Two task types on the same knowledge, tested here on language models of different sizes, where "B parameters" means billions of weights in the neural network (the model's size; e.g., 8B means eight billion, roughly a mid-sized model):
- Cloze (recall) — "What is the capital of Denmark?" — saturates early. Most models reach ~90% accuracy by 8B parameters.
- Application (chaining facts) — "If the capital lies on Sealand, and you take a train west from there..." — scales monotonically from 2% (0.5B parameters) to 85% (70B — the size of Llama 70B).
Same knowledge, different load. Cloze is indexing-bound; application is composition-bound. The bottleneck migrates with capacity.
Implication for educational science: the same knowledge encoded at different capacity levels supports different task types. A learner who can do cloze cannot necessarily do application; the gap is not motivation, it is composition-bounded computation.
"Catastrophic forgetting" is signal redistribution, not damage
Catastrophic forgetting in fine-tuned LLMs has been interpreted as substrate damage: the base model "loses" knowledge during adaptation. This interpretation is empirically falsified.
The reverse-test (v13c, Paper 6 forthcoming): remove the LoRA adapter, and base performance returns to 100% of baseline, a 179.5% recovery relative to the adapter-degraded state. The base substrate is 100% intact; the adapter rebalances which routes win competition, but does not damage the underlying weights.
The mechanism is signal-budget redistribution: under fine-tuning, route-competition shifts toward the new task, away from the original. The original capability is preserved. It is just outranked. Removing the adapter restores the original ranking.
This subsumes six previously-distinct phenomena under one mechanism:
- Catastrophic forgetting in continual learning
- Long-train mode collapse
- Dementia retrieval-failure (preserved-but-unreachable knowledge)
- Bjork desirable difficulties
- Spaced repetition advantage (in biological substrates)
- Bahrick's permastore retention plateau (Bahrick 1984 — long-term retention 3-5 years post-learning)
The design rule: "want less: dilute; want more: protect."
Bjork's desirable difficulties get a mechanistic foundation
Bjork (1994) argued that desirable difficulties (effortful retrieval, spacing, interleaving) produce better long-term retention than easy practice. The framework provides a mechanism: difficulty raises route-competition, which deepens the hysteresis trace, which is what gets retained.
The prediction is testable in artificial substrates: calibrated retrieval-practice should preserve the slope from recognising an answer to settling on it, while calibration-naive training (RLHF-style suppression of friction) should flatten it.
Expertise reversal effect
Kalyuga, Ayres, Chandler & Sweller (2003) found that instructional supports that help novices hurt experts. Worked examples accelerate beginner learning but slow expert performance, because experts have already encoded the pattern and the support now competes with their internal model.
The framework prediction: this should generalise to artificial substrates as a substrate-graded U-curve. Tested across three model sizes:
- Qwen2-1.5B: flat at 4-6% — substrate too limited to show the curve
- Qwen2.5-7B: monotone gain — novice tier
- Llama-3.3-70B: classical U-curve — 73% → 50% → 61% — expert tier shows the reversal (Paper 4b in preparation)
The substrate-graded scope-condition is novel: the U-curve appears only above a capacity threshold; below that, the substrate cannot represent enough alternatives for the conflict to manifest.
What language models cannot test
Several classical learning phenomena are structurally untestable on inference-time LLMs because the substrate lacks features the human version requires:
- Spaced repetition — the mechanism requires remembering across sessions. LLMs do not. Testable only via fine-tuning weight drift over training cycles.
- Ebbinghaus forgetting curve — same constraint. Requires retention measurement across time-separated sessions.
- Cross-session interference — when new learning interferes with previously learned material across sessions. Requires session-to-session memory.
The pattern: between-session memory phenomena require fine-tuning experiments, not inference-time probing. This is a methodological constraint following directly from substrate features.
Implications
For educational science: Bjork's desirable difficulties get a mechanistic foundation. Difficulty is not arbitrary: it is whatever raises route-competition enough to leave a deep hysteresis trace. This predicts which interventions transfer (those that raise route-competition specifically) and which do not (those that just add cognitive load without competition).
For AI training: friction profile during training should predict retention. Calibrated retrieval-practice should preserve the slope from recognising an answer to settling on it; RLHF-style friction suppression should flatten it.
For clinical translation: signal-budget redistribution as a mechanism for retrieval failure is a hypothesis pursued in Paper 8c (forthcoming). The prediction is that parts of dementia may present as failure to commit despite preserved knowledge, the same friction-ceiling pattern observed in LLMs. Diagnostic implication: sub-threshold cuing tests should distinguish encoded-but-unreachable from unencoded. This is a testable prediction, not an established clinical finding.
For compliance and workplaces: the same mechanics explain why information-heavy compliance courses rarely change behaviour. More text does not build the route. I unpack this on Compliance is behaviour, not information.
For Dunning-Kruger: the classic "most certain when you know least" curve falls straight out of the same mechanics. I show it, measured in real language models, on Why "knows little, believes a lot".
If you want the picture underneath all of this, What is a race? starts with water: tanks, pipes, and the channels in the sand that learning leaves behind.
For the full technical version with specific statistics, paper references, and protocol details, see learning (technical).