Learning — what the framework predicts and finds
From hysteresis as precondition to signal-budget redistribution
Novel scope-condition: the classical expertise-reversal effect (instructional supports help novices but hurt experts; Kalyuga) appears in LLMs only above a model-capacity threshold. Llama-3.3-70B shows the U-curve cleanly; smaller models cannot, because they lack the capacity to be in the "expert" regime where additional examples become interference. This is the first substrate-graded statement of expertise-reversal.
Learning is one of the central themes across the paper series. The framework's claim: learning is a direct consequence of competition under load, not a separate cognitive module. Where you have race architecture (parallel evaluation, bounded resources, irreversible commit), you get learning when commitment leaves a path-dependent trace. Where you do not, you do not.
On this page
- 1. Hysteresis as the precondition for learning
- 2. Encoding-through-loading: what gets encoded depends on what wins competition
- 3. Capacity scaling: cloze versus application asymmetry
- 4. "Catastrophic" forgetting is signal-budget redistribution, not damage
- 5. Calibrated retrieval-practice and Bjork desirable difficulties
- 6. Expertise reversal effect — substrate-graded
- 7. What language models cannot test (and why)
- 8. Implications
1. Hysteresis as the precondition for learning
Hysteresis, path-dependent state retention, has long been treated as an error or side-effect to be minimised. The framework reframes it: across the bounded probabilistic substrates examined in this framework, hysteresis appears to be the structural precondition for learning. In a substrate that bears no trace of its own history, learning does not occur in any of the cases tested so far. Path-dependent state is what makes learning structurally possible.
This applies across the bounded probabilistic substrates the framework examines:
- Biological brains — synaptic weights change because activity leaves a trace
- Artificial neural networks — weights update because the loss-trajectory is path-dependent
- Physical systems with memory — magnetisation, glasses, polymer dynamics show learning-like adaptation that may share the same race-structure, differing in substrate not in shape; a shared vocabulary, not a claim the substrates are identical
Hysteresis is empirically replicated cross-architecture in transformers (Cogito-671B, DeepSeek-V3, Qwen3-235B, Llama-3.3-70B) and in a State Space Model (LiquidAI LFM2). The cross-architecture replication is "the strongest evidence in our data that friction mechanics are a property of the race architecture, not of any specific computational implementation." The same conclusion follows for biological substrates.
Source: Paper 1 §2.3, §5.8.4
2. Encoding-through-loading
The standard cognitive-science view treats encoding as a separate process from retrieval and decision. The framework collapses this: what gets encoded is what wins route-competition under load. There is no separate encoding module: the same race-resolution machinery that produces decisions also leaves the trace that constitutes learning.
This connects to two classical findings:
- Levels of processing (Craik & Lockhart 1972; Craik & Tulving 1975): deeper semantic processing produces stronger encoding. The framework explanation: deeper processing requires resolving more competing routes, which leaves a richer hysteresis trace.
- Distinctiveness effect / Von Restorff (von Restorff 1933): outliers are remembered better. Tested at the gradient level in fine-tuned LLMs (Paper 4): violations of an implicitly learned pattern produce stronger encoding than conforming instances. Mechanism refined to distinctiveness-confidence — surprise-magnitude-proportionality at the gradient-trace level.
Source: Paper 1 §6.4 · Paper 2
3. Capacity scaling
Two task types on the same knowledge base differentiate by capacity:
- Cloze retrieval — recovering a fact as presented — saturates early. Most models reach ~90% accuracy by 8B parameters.
- Application — chaining multiple facts into a derivation — scales monotonically across three orders of magnitude, from 2% at 0.5B to 85% at 70B. Spearman ρ = +1.000 on the Qwen2.5 ladder; +40.8 percentage points per decade across the cross-family panel.
The bottleneck migrates with capacity: at 0.5B retrieval fails. At 14B retrieval is saturated and 36% of failures show the "retrieval succeeds, derivation fails" pattern, the friction-ceiling signature at the encoding level.
Implication for educational science: the same knowledge encoded at different capacity levels supports different task types. A learner who can do cloze cannot necessarily do application; the gap is not motivation, it is composition-bounded computation.
Source: Paper 2
4. "Catastrophic" forgetting is signal-budget redistribution
Catastrophic forgetting in fine-tuned LLMs has been interpreted as substrate damage: the claim that the base model "loses" knowledge during adaptation. This interpretation is empirically falsified.
The reverse-test (v13c, Paper 6 forthcoming): remove the LoRA adapter, and the base substrate recovers 179.5% of baseline performance. The base substrate is 100% intact; the adapter rebalances which routes win competition, but does not damage the underlying weights.
The mechanism is signal-budget redistribution: under fine-tuning, route-competition shifts toward the new task, away from the original. The original capability is preserved. It is just outranked. Removing the adapter restores the original ranking.
This subsumes six previously-distinct phenomena under one mechanism:
- Catastrophic forgetting in continual learning
- Long-train mode collapse
- Dementia retrieval-failure (preserved-but-unreachable knowledge)
- Bjork desirable difficulties
- Spaced repetition advantage (in biological substrates)
- The classic Ebbinghaus forgetting curve
Tomas's framing for the design rule: "want less — dilute; want more — protect."
Source: Paper 1 §5.8.4 (companion mechanism developed in Paper 6, forthcoming)
5. Calibrated retrieval-practice and Bjork's desirable difficulties
Bjork (1994) argued that desirable difficulties (effortful retrieval, spacing, interleaving) produce better long-term retention than easy practice. The framework provides a mechanism: difficulty raises route-competition, which deepens the hysteresis trace, which is what gets retained.
The prediction is testable in artificial substrates: calibrated retrieval-practice should preserve the recognition-to-commit slope, while calibration-naive training (RLHF-style suppression of friction) should flatten it.
Operational definition: the recognition-to-commit slope is the regression coefficient β in pcommit(answer) = α + β · precognition(answer) + ε, where precognition is the model's logprob on the correct answer under a low-stakes recognition prompt ("which of these is correct?") and pcommit is the model's logprob on the correct answer under a high-stakes commit prompt ("answer in one word"). A well-calibrated model has β ≈ 1: knowing-it predicts committing-to-it. RLHF-flattened models show β < 0.5: recognition decouples from commitment, which is the substrate-level signature of the friction-ceiling pattern.
Paper 4 v10d implements this as a 4-arm design with pre-registered protocol:
- Passive arm: standard SFT exposure
- Surface arm: shallow retrieval (cloze-like)
- Deep arm: deep retrieval-practice (application-like)
- Calibrated retrieval-practice arm: difficulty calibrated to model's own friction profile
Outcome measures: ECE (calibration), slope (recognition-commit relation), and OOD defer-rate (whether the model knows when not to commit).
Source: Paper 1 §5.8.7 · Paper 4 (forthcoming) · Paper 6 (forthcoming)
6. Expertise reversal effect
Kalyuga, Ayres, Chandler & Sweller (2003) found that instructional supports that help novices hurt experts. Worked examples accelerate beginner learning but slow expert performance, because experts have already encoded the pattern and the support now competes with their internal model.
The framework prediction: this should generalise to artificial substrates as a substrate-graded U-curve. Tested in Paper 4b Exp 1 across three model sizes:
- Qwen2-1.5B: flat at 4-6% across 0/1/3-shot ICL — substrate too limited to show the curve
- Qwen2.5-7B: monotone gain (+12pp at 1-shot) — novice tier, additional examples help
- Llama-3.3-70B: classical U-curve — 73% → 50% → 61% at 0/1/3-shot — expert tier shows the reversal
The expert tier (70B) shows the same expertise reversal pattern Kalyuga reported for human experts. The substrate-graded scope-condition is novel: the U-curve appears only above a capacity threshold; below that, the substrate cannot represent enough alternatives for the conflict to manifest.
Source: Paper 1 §5.8.7 · Paper 3 §5.4 · Paper 4b (forthcoming)
7. What language models cannot test
Several classical learning phenomena are structurally untestable on inference-time LLMs because the substrate lacks features the human version requires:
The pattern: between-session memory phenomena require fine-tuning experiments, not inference-time probing. This is a methodological constraint following directly from substrate features.
Source: Paper 1 §9.4 (future work)
8. Implications
For educational science: Bjork's desirable difficulties get a mechanistic foundation. Difficulty is not arbitrary: it is whatever raises route-competition enough to leave a deep hysteresis trace. This predicts which interventions transfer (those that raise route-competition specifically) and which do not (those that just add cognitive load without competition).
For AI training: friction profile during training should predict retention. Calibrated retrieval-practice should preserve recognition-commit slope; RLHF-style friction suppression should flatten it. Paper 4 v10d tests this directly.
For cognitive science: a bridge between human and artificial learning. Phenomena previously studied in humans (Bjork, Von Restorff, Craik & Tulving, expertise reversal) become measurable in substrates where the friction signal is observable. Cross-substrate validation becomes possible.
For clinical translation: signal-budget redistribution as a mechanism for retrieval failure (Paper 8c forthcoming). Dementia presents as failure to commit despite preserved knowledge: the same friction-ceiling pattern observed in LLMs. Diagnostic implication: sub-threshold cuing tests should distinguish encoded-but-unreachable from unencoded.
Related pages
- Findings & new explanations — broader list of empirical discoveries and reframings
- Cross-substrate phenomena — what humans and LLMs share, and where they diverge
- Paper 2 (Capacity Scaling) — the encoding-side empirical paper
- Paper 1 (Friction Theory) — the substrate-universal foundation