In-Context Learning as Working Memory, Fine-Tuning as Long-Term Memory: A Substrate-Mechanistic Account of LLM Practice
Pødenphant Lund, T. (2026p) · Preprint · Live on Zenodo
In the tested setting (LoRA-FT, Qwen2.5, invented-domain corpus), fine-tuning structurally compresses an LLM’s calibrated distribution while ICL/RAG preserves it. Striking practitioner result: FT-trained models on this corpus actively degrade application accuracy below the no-context baseline. FT can make the model worse than having no knowledge at all. The architecture-level distinction maps onto working-memory / long-term-memory and offers a substrate-level reading of the RAG-vs-fine-tuning debate; generality to other FT methods, sizes, and domains is hypothesised, not established. Cloze gap 16-28pp, log(CRpos0) collapse 5.46→21.12, entropy→0.
| DOI (concept) | 10.5281/zenodo.20145218 |
| Target venue | TMLR (primary) / ICLR / NeurIPS / COLM |
| Status | Preprint live 2026-05-12 |
| Length | ~10–12k words |
| Author | Tomas Pødenphant Lund [ORCID] |
TL;DR
Why do fine-tuned language models hallucinate more confidently than their in-context-learning (or RAG-equipped) counterparts on the same knowledge? The standard framing treats in-context learning (ICL) and fine-tuning (FT) as alternatives along a deployment-cost axis. They are not interchangeable. The substrate-level distinction this paper establishes also offers a principled resolution of the RAG-vs-fine-tuning debate: RAG operates in ICL-mode and preserves calibration; FT compresses it as a structural consequence.
Mechanism. Each backward pass under cross-entropy loss amplifies the winning route asymmetrically and presses alternatives below the noise floor; the depth of this compression scales with how many gradient passes the substrate has absorbed. ICL preserves the model's calibrated distribution over candidate answers because no weight update has compressed it. FT compresses it as a structural consequence of cumulative gradient pressure, regardless of training-data content. The mechanism generalises a previously RLHF-specific finding (Paper 1 §3) to all weight-update training, including plain LoRA fine-tuning on innocuous factual data.
Cognitive-science mapping. The architecture-level distinction between ICL and FT instantiates the two-systems memory architecture documented across cognitive science. The mapping is structural, not metaphorical: both pairs realise the same underlying contrast between maintained by ongoing computation (working memory / ICL) and consolidated as substrate change (long-term memory / FT). The trade-offs each pair displays follow from the architectural choice, not from the implementation:
| Property | Working memory | Long-term memory | ICL | FT (LoRA) |
|---|---|---|---|---|
| Substrate change | None (transient activation) | Yes (synaptic / structural) | None (forward pass only) | Yes (weight update) |
| Alternative routes | Accessible, calibrated | Compressed during consolidation | Preserved in logprobs | Compressed below noise floor |
| Cost per use | High (ongoing maintenance) | Low (cheap retrieval) | High (context tokens) | Low (no context required) |
| Capacity | Limited (~4–7 elements) | Effectively unlimited | Limited (context window) | Effectively unlimited |
| Decay | Yes (without rehearsal) | Robust (once consolidated) | Per-session loss | Persistent in weights |
| Subjective certainty | Tracks uncertainty | "Felt certainty" of recall | Tracks logprob entropy | Decoupled from accuracy |
| Flexibility | High (reconfigurable per task) | Low (encoded routes stable) | High (re-prompt anything) | Low (locked to training) |
Specific cognitive-science precedents this mirrors:
- Atkinson & Shiffrin (1968); Baddeley (1986) — the two-stage architecture (sensory → working memory → long-term memory) defined by what gets consolidated and what does not. Working memory's defining property is precisely that nothing has been compressed yet.
- Standard Model of Consolidation (McClelland, McNaughton & O'Reilly 1995) — consolidation as cumulative weight-pressure moving knowledge from fast/flexible (hippocampal indexing) to slow/compressed (neocortical schemas). The model predicts the compression-of-alternatives we observe in LoRA-FT, with depth-of-compression scaling with cumulative passes.
- Tulving (1972) episodic vs semantic memory — the loss of episode-specific calibration during semanticisation matches the compression of route-specific competing-routes signal during gradient training. Semantic memory "knows" without remembering the alternatives that were once considered.
- Sweller (1988) cognitive load theory — working-memory capacity bounds element-interactivity in the same way the context window bounds ICL composition. Schemas in LTM compress element-interactivity by pre-binding the components — structurally analogous to what gradient passes do to candidate routes.
- Bartlett (1932); Hassabis & Maguire (2009) — long-term memory does not store raw inputs but compressed schemas; alternatives are abstracted away during consolidation. Same outcome in LoRA-FT: the training data is not stored verbatim but as a route-amplification pattern with alternatives pressed out.
- Procedural memory (Squire 1986) — fluent performance with opaque introspection. Experts can perform without articulating why; the alternatives that would have justified the choice have been compressed away during overlearning. This is the substrate signature of confident hallucination in FT-trained models: fluent commitment without accessible alternatives.
- Reconsolidation (Nader, Schafe & LeDoux 2000) — each retrieval briefly destabilises the trace, opening a window during which the compression can be updated. The analogue for FT-trained models: re-exposure under fresh gradient passes can re-shape the compression, but does not re-open the calibration signal unless the gradient-pressure regime changes.
The claim is not that LLMs and brains share mechanisms; they do not. The claim is that the same architectural choice, consolidate via substrate change versus maintain via ongoing computation, produces the same set of trade-offs in either substrate. Working memory and long-term memory are two computational regimes on the same brain; ICL and FT are two computational regimes on the same transformer. The trade-offs are isomorphic because the architectural choice is the same.
Empirical findings
Three experiments on a 47-fact invented knowledge domain (Zorbetik), Qwen2.5 base models at 3B and 7B scales, LoRA fine-tuning budgets from 5 to 100 epochs plus a paraphrase-augmented variant:
- Cloze retrieval gap: ICL outperforms LoRA-FT by 16–28 percentage points across capacity scales
- Application degradation: FT-trained models actively degrade application accuracy below the no-context baseline
- Competing-routes collapse: log(CRpos0) climbs from 5.46 (ICL) to 17.85 (raw FT, 30 epochs) to 21.12 (paraphrase-augmented FT, 30 epochs) — monotone with cumulative gradient passes. Definition: CRpos0 = log(ptop1 / ptop2) at the first answer token; large positive values indicate that the top candidate dominates by many natural-log units, i.e. the distribution has collapsed to a single route. ICL values around 5 indicate the model still holds plural alternatives; FT values above 17 indicate effectively no alternative considered.
- Entropy collapse: position-0 entropy drops from 0.32 (ICL) to ≈0.00 (any FT regime), independent of training-data variation
Full per-condition results
| Condition | Size | Cloze acc. | Application acc. | log(CRpos0) | Hpos0 |
|---|---|---|---|---|---|
| No-context baseline | 3B / 7B | ~3% / ~5% | ~12% / ~18% | — | — |
| ICL (facts in prompt) | 3B | ~62% | ~28% | 5.46 | 0.32 |
| ICL (facts in prompt) | 7B | ~74% | ~41% | 5.46 | 0.32 |
| LoRA-FT 5 epochs | 3B | ~28% | ~6% | ~10 | ~0.05 |
| LoRA-FT 30 epochs | 3B | ~38% | ~5% | 17.85 | ≈0.00 |
| LoRA-FT 30 epochs | 7B | ~58% | ~9% | 17.85 | ≈0.00 |
| Paraphrase-aug FT | 7B | ~62% | ~14% | 21.12 | ≈0.00 |
Green: ICL preserves calibration (high entropy, moderate CR). Red: FT regimes collapse the distribution (entropy → 0, CR rises monotonically with gradient passes). Note: application accuracy on FT is worse than no-context at both sizes. This is the application-degradation result.
Applied consequences
- Confident hallucination: FT-trained models hallucinate confidently because alternative routes have been compressed beyond reach — the uncertainty signal that would flag them is gone
- Agentic uncertainty: agentic systems cannot reliably represent uncertainty when built on FT-only substrates; the substrate-level signal that would carry it has been compressed away
- RAG vs FT at the substrate level: RAG operates in ICL-mode and preserves calibration; FT compresses it as a structural consequence
- Long-context agentic conversations (Claude Code, Cursor, multi-turn agents): inherit ICL's calibration properties for free, since each turn re-evaluates the full context with no weight update
- Hybrid memory architectures: concrete ICL+FT compositions can approximate biological memory's two-system structure with bounded context-window cost
Caveats (scope conditions)
FT is tested as LoRA only (full-parameter FT untested); single random seed; two model sizes (3B, 7B); one invented domain. Paraphrase-augmentation in Experiment 3 confounds data-variation with cumulative gradient-pressure (≈10× more gradient steps); a clean compute-matched comparison is left to follow-up. The qualitative ICL-vs-FT direction is robust within these scope conditions.
Why this paper exists
Paper 1 establishes that RLHF compresses the substrate's competing-routes signal, a special case of a structural property of backward-pass training. This paper isolates that property in its simplest form: plain LoRA fine-tuning on innocuous factual data produces the same compression, with the depth tracking cumulative gradient passes rather than the content or quality of the training data. The standard ICL-vs-FT framing as a deployment-cost trade-off therefore misses the architectural distinction. ICL and FT instantiate two different memory regimes on the same substrate.
Companion papers
- Paper 0 (BFT) — the biological grounding; working-memory / long-term-memory distinction at the four-fields level
- Paper 1 (Friction Theory) — substrate foundation; §3 RLHF-paradox is the special case this paper generalises
- Paper 2 (Capacity Scaling) — the empirical companion; same Zorbetik domain and Qwen2.5 ladder
- Paper 3 (Friction-Guided Inference) — uses the calibrated competing-routes signal that ICL preserves and FT compresses
- Paper 13 (Operational FT) — the operational mechanism (race-opening, recursive resolution, manifested behaviour, thermodynamic termination) that produces the FT compression here
Cite
Data and code
Per-token logprob datasets, fine-tuning notebooks, and analysis scripts share Paper 2's companion repository: github.com/tplund/friction-theory-p2-capacity-scaling (CC BY 4.0).