In-context learning as working memory, fine-tuning as long-term memory

Paper 2B · Pødenphant Lund (2026p) · Read on Zenodo

I study language models to understand people.Fine-tuned language models give more confident wrong answers than RAG-based ones on the same knowledge, and the reason is the same physics that makes your long-term memory feel more certain than your working memory. A surprising consequence: in this experiment, fine-tuning made the model worse at applying knowledge than if it had never been given the knowledge at all. The RAG-vs-fine-tuning debate has a substrate-level answer.

What is this about?

There are two standard ways to give a language model new knowledge. You can keep it in the prompt. That is in-context learning (ICL); RAG (retrieval-augmented generation) is the most common practical use of that approach. Or you can train it into the model's weights, usually with a method called LoRA. That is fine-tuning (FT). The usual view is that RAG and fine-tuning are two alternatives on a cost axis: RAG/ICL is more expensive to use (you carry the retrieved context along), FT is more expensive to make (you actually have to train), but either way you end up with "the same knowledge."

You do not end up with the same knowledge. You end up with knowledge held in two completely different ways, and the difference shows up sharply in the model's behaviour. This is the substrate-level answer (substrate level = how knowledge is actually represented in the model's physical structure) to the RAG-vs-fine-tuning debate: they are not two implementations of the same memory. They are two different memory regimes.

The mechanism in one paragraph

Every time you take a gradient step on a language model (every backward pass under the usual loss) you strengthen the route that produced the "right" answer and push the alternatives down. Do it once, and the effect is small. Do it thousands of times, and the alternatives get pushed below the noise floor. They are effectively gone from the distribution the model can call up. ICL does not do this. ICL just runs the prompt forward through the model, and the model produces an answer. The distribution over candidate answers is still there. It is just computed on the fly. FT compresses that distribution as a structural consequence of how training works, not because of anything specific in the training data.

The parallel to cognitive science

That distinction looks exactly like the one Atkinson and Shiffrin drew in 1968, and that Alan Baddeley later refined: the difference between working memory and long-term memory. ICL is working memory. FT is long-term memory. They are not two implementations of the same thing. They are two different memory regimes, and language models have both, just as humans do.

To make the parallel concrete, here is how the two systems compare in humans, alongside how the same split shows up in language models:

Property	Working memory	Long-term memory	ICL	FT
Held by	Active brain activity	Synaptic structure	Forward computation	Weights
Alternatives	Stay available	Compressed away	Visible in logprobs	Pushed below the noise floor
Cost per use	High (attention & energy)	Low (cheap retrieval)	High (context tokens)	Low (no context)
Capacity	About 4–7 items	Practically unlimited	Context window	Practically unlimited
Does it decay?	Yes, fast without rehearsal	No, once consolidated	Per session	Holds in the weights
How certain does it feel?	Honestly uncertain	"I just know it"	Actually tracks uncertainty	Confident regardless

"Felt certainty" in long-term memory is the key

When you hold a phone number in working memory, you know perfectly well that you might forget it. You feel the uncertainty. When you recall your own birthday, you feel no uncertainty at all; it simply is. That subjective difference is the surface signature of a real architectural difference. Working memory keeps the alternatives available, so it knows what it does not know. Long-term memory consolidated the answer at the cost of compressing everything else away, so the "I might be wrong" signal got compressed away along with the alternatives.

Most of the time that is fine, because long-term memory is usually right. But when it is wrong (false memories, fluent confabulation, the confident wrong answer in an exam) the error arrives with the same felt certainty as the correct memories. There is no warning signal. This is exactly what FT-trained language models do when they hallucinate. The substrate signal that would have flagged the answer as uncertain has been pushed below the noise floor.

Procedural memory is an even closer parallel

There is a third category of human memory that often gets lumped under "long-term": procedural memory. It is how you ride a bike, type on a keyboard, or drive a familiar route. Procedural memories are even more compressed than ordinary long-term memories: you cannot articulate how you do it. The alternatives that were once weighed during the learning phase are gone. You do not think about your left foot when you walk. The whole choice architecture has been compiled into something that runs without conscious access.

That is what an overtrained fine-tuned model looks like. It runs fluently. It commits to answers without visible deliberation. And if you ask it "how did you decide that?" it produces a post-hoc rationalisation, because the actual decision substrate no longer carries that information. The alternatives the answer was chosen against have been compressed away.

Why both systems exist (in brains and in language models)

You could not live with working memory alone. Every fact, every skill, every word of language would have to be held active, every moment, at a metabolic cost. You would run out of capacity in seconds. Long-term memory exists because it is cheaper to consolidate frequently-used knowledge into the structure than to recompute it every time. The compression cost, the loss of alternatives and the loss of calibrated uncertainty, is what you trade for the cheapness.

You could not live with long-term memory alone either. You could not reason about new situations, hold tentative hypotheses, or notice that you do not know something. Working memory is what keeps the system honest about uncertainty.

Brains have both. Language models have both: ICL when you give them new information through the prompt, FT when you train it in. The mistake is to treat ICL and FT as alternatives on a cost axis. They are not alternatives. They are complements, exactly as working memory and long-term memory are complements in human cognition. Cheap reliable retrieval lives in long-term memory / FT; calibrated reasoning under uncertainty lives in working memory / ICL. A well-designed system uses both.

That is what the paper proposes hybrid memory architectures should look like: fine-tune the cheap, settled knowledge into the weights, and use the context window for the calibrated, uncertainty-aware part of the reasoning. The two-system architecture that biological memory landed on through evolution has a structural reason behind it, and language models will land on the same architecture if they are deployed to the same kinds of tasks.

Here is the striking thing about all of it: nobody designed it. You could not have invented a smarter system for learning (two regimes, one calibrated and flexible, one cheap and consolidated, with an elegant trade-off between them) and yet nobody sat down and engineered it. Both systems exist because the same physics applies. Working memory and long-term memory are not features of biology. They are what falls out of the constraint that you have to choose between held-by-computation and consolidated-by-substrate-change. Brains arrived at the architecture under selection pressure. Transformers arrived at the same architecture under gradient descent. Neither of them knew where they were heading. Both ended up at the only architecture available.

What the experiments showed

I ran three experiments on an invented 47-fact knowledge base called "Zorbetik" (invented so the model could not already know it), tested on Qwen2.5 base models at 3B and 7B parameters, with LoRA fine-tuning budgets from 5 to 100 epochs:

ICL wins on cloze recall by 16–28 percentage points. The model that has just been told the fact answers better than the model that has been trained on the fact, often by a large margin.
FT actually makes application worse. Not just worse than ICL, but worse than the no-context baseline. Training the model on knowledge degraded its ability to use that knowledge to answer related questions.
The competing-routes signal collapses with training. There is a measure of how many candidate answers the model is considering at the first token. With ICL it is about 5. With raw FT after 30 epochs it is about 17. With paraphrase-augmented FT (which gets even more gradient steps) it is 21. The model commits earlier and earlier, with fewer and fewer alternatives in mind.
Entropy goes toward zero. Under any FT regime, the entropy at position 0 (a measure of how uncertain the model is about its first answer word) collapses to practically zero. The model is locked in. No matter what was in the training data.

ICL preserves the substrate's hesitation signal at log(CR_pos0) ≈ 5.46. Raw fine-tuning at 30 epochs pushes it to 17.85; paraphrase-augmented FT (~10× more gradient steps) to 21.12. The model commits earlier and earlier; the alternatives get pushed below the noise floor. Position-0 entropy collapses from 0.32 (ICL) to ≈0.00 (any FT regime).

Why this matters

Confident hallucination

FT-trained models hallucinate confidently because alternative answers have been compressed out of reach. The "I am not sure" signal that would normally flag a wrong answer is gone, not because the model is sure, but because the substrate signal that carries uncertainty has been pushed below the noise floor by cumulative training. The model is not lying about its confidence; the part of its computation that would have given it second thoughts has been silenced.

Agentic systems cannot represent uncertainty when they are FT-only

A system built on a fine-tuned model cannot reliably tell its operator "I don't know" or "I'm only 60% sure." The substrate signal that would have supplied that information has been compressed. This is structural, not a fault in any specific training run.

RAG vs FT at the substrate level

The endless RAG-vs-fine-tuning debate has a substrate-level resolution: RAG operates in ICL mode. It pulls retrieved documents into the prompt, and the model evaluates them forward. Calibration is preserved. FT compresses calibration as a structural consequence. They are not two implementations of the same memory; they are two different memory regimes.

Long-context agents inherit ICL's calibration for free

Claude Code, Cursor, and any multi-turn agentic conversation operate in ICL mode by default: each turn re-evaluates the full context with no weight update. They inherit working memory's calibration properties automatically. That is why long-context agentic conversations can feel "more honest" than a fine-tuned chatbot on the same knowledge: the substrate does not lie about its uncertainty because the substrate signal is still intact.

Hybrid memory architectures

Concrete ICL+FT compositions can approximate the two-system structure of biological memory: long-term-consolidated knowledge in the weights for cheap retrieval; working-memory-mode calibrated reasoning in the context for uncertainty-aware application. A bounded context-window cost, but with the calibration property where it matters.

What this does not claim

FT is only tested as LoRA (full-parameter fine-tuning is not tested). A single random seed. Two model sizes (3B and 7B). One invented knowledge domain. The paraphrase-augmented condition in Experiment 3 has ≈10× more gradient steps than the raw FT condition, so a clean compute-matched comparison is left to follow-up. The qualitative direction (ICL beats LoRA-FT on calibration; the gap scales with gradient steps) holds within these scope conditions.