In-Context Learning as Working Memory, Fine-Tuning as Long-Term Memory: A Substrate-Mechanistic Account of LLM Practice

Pødenphant Lund, T. (2026p) · Preprint · Live on Zenodo

In the tested setting (LoRA-FT, Qwen2.5, invented-domain corpus), fine-tuning structurally compresses an LLM’s calibrated distribution while ICL/RAG preserves it. Striking practitioner result: FT-trained models on this corpus actively degrade application accuracy below the no-context baseline. FT can make the model worse than having no knowledge at all. The architecture-level distinction maps onto working-memory / long-term-memory and offers a substrate-level reading of the RAG-vs-fine-tuning debate; generality to other FT methods, sizes, and domains is hypothesised, not established. Cloze gap 16-28pp, log(CRpos0) collapse 5.46→21.12, entropy→0.

DOI (concept)10.5281/zenodo.20145218
Target venueTMLR (primary) / ICLR / NeurIPS / COLM
StatusPreprint live 2026-05-12
Length~10–12k words
AuthorTomas Pødenphant Lund [ORCID]

TL;DR

Why do fine-tuned language models hallucinate more confidently than their in-context-learning (or RAG-equipped) counterparts on the same knowledge? The standard framing treats in-context learning (ICL) and fine-tuning (FT) as alternatives along a deployment-cost axis. They are not interchangeable. The substrate-level distinction this paper establishes also offers a principled resolution of the RAG-vs-fine-tuning debate: RAG operates in ICL-mode and preserves calibration; FT compresses it as a structural consequence.

Mechanism. Each backward pass under cross-entropy loss amplifies the winning route asymmetrically and presses alternatives below the noise floor; the depth of this compression scales with how many gradient passes the substrate has absorbed. ICL preserves the model's calibrated distribution over candidate answers because no weight update has compressed it. FT compresses it as a structural consequence of cumulative gradient pressure, regardless of training-data content. The mechanism generalises a previously RLHF-specific finding (Paper 1 §3) to all weight-update training, including plain LoRA fine-tuning on innocuous factual data.

Cognitive-science mapping. The architecture-level distinction between ICL and FT instantiates the two-systems memory architecture documented across cognitive science. The mapping is structural, not metaphorical: both pairs realise the same underlying contrast between maintained by ongoing computation (working memory / ICL) and consolidated as substrate change (long-term memory / FT). The trade-offs each pair displays follow from the architectural choice, not from the implementation:

PropertyWorking memoryLong-term memoryICLFT (LoRA)
Substrate changeNone (transient activation)Yes (synaptic / structural)None (forward pass only)Yes (weight update)
Alternative routesAccessible, calibratedCompressed during consolidationPreserved in logprobsCompressed below noise floor
Cost per useHigh (ongoing maintenance)Low (cheap retrieval)High (context tokens)Low (no context required)
CapacityLimited (~4–7 elements)Effectively unlimitedLimited (context window)Effectively unlimited
DecayYes (without rehearsal)Robust (once consolidated)Per-session lossPersistent in weights
Subjective certaintyTracks uncertainty"Felt certainty" of recallTracks logprob entropyDecoupled from accuracy
FlexibilityHigh (reconfigurable per task)Low (encoded routes stable)High (re-prompt anything)Low (locked to training)

Specific cognitive-science precedents this mirrors:

The claim is not that LLMs and brains share mechanisms; they do not. The claim is that the same architectural choice, consolidate via substrate change versus maintain via ongoing computation, produces the same set of trade-offs in either substrate. Working memory and long-term memory are two computational regimes on the same brain; ICL and FT are two computational regimes on the same transformer. The trade-offs are isomorphic because the architectural choice is the same.

Empirical findings

Three experiments on a 47-fact invented knowledge domain (Zorbetik), Qwen2.5 base models at 3B and 7B scales, LoRA fine-tuning budgets from 5 to 100 epochs plus a paraphrase-augmented variant:

Full per-condition results

Condition Size Cloze acc. Application acc. log(CRpos0) Hpos0
No-context baseline3B / 7B~3% / ~5%~12% / ~18%
ICL (facts in prompt)3B~62%~28%5.460.32
ICL (facts in prompt)7B~74%~41%5.460.32
LoRA-FT 5 epochs3B~28%~6%~10~0.05
LoRA-FT 30 epochs3B~38%~5%17.85≈0.00
LoRA-FT 30 epochs7B~58%~9%17.85≈0.00
Paraphrase-aug FT7B~62%~14%21.12≈0.00

Green: ICL preserves calibration (high entropy, moderate CR). Red: FT regimes collapse the distribution (entropy → 0, CR rises monotonically with gradient passes). Note: application accuracy on FT is worse than no-context at both sizes. This is the application-degradation result.

Applied consequences

Caveats (scope conditions)

FT is tested as LoRA only (full-parameter FT untested); single random seed; two model sizes (3B, 7B); one invented domain. Paraphrase-augmentation in Experiment 3 confounds data-variation with cumulative gradient-pressure (≈10× more gradient steps); a clean compute-matched comparison is left to follow-up. The qualitative ICL-vs-FT direction is robust within these scope conditions.

Why this paper exists

Paper 1 establishes that RLHF compresses the substrate's competing-routes signal, a special case of a structural property of backward-pass training. This paper isolates that property in its simplest form: plain LoRA fine-tuning on innocuous factual data produces the same compression, with the depth tracking cumulative gradient passes rather than the content or quality of the training data. The standard ICL-vs-FT framing as a deployment-cost trade-off therefore misses the architectural distinction. ICL and FT instantiate two different memory regimes on the same substrate.

Companion papers

Cite

Pødenphant Lund, T. (2026p). In-Context Learning as Working Memory, Fine-Tuning as Long-Term Memory: A Substrate-Mechanistic Account of LLM Practice [Preprint]. Zenodo. https://doi.org/10.5281/zenodo.20145218

Data and code

Per-token logprob datasets, fine-tuning notebooks, and analysis scripts share Paper 2's companion repository: github.com/tplund/friction-theory-p2-capacity-scaling (CC BY 4.0).