Fine-tuning hides old knowledge rather than deleting it
Paper 2F · Pødenphant Lund (2026) · Read on Zenodo
I use friction theory to look inside what a model actually does when you "correct" it.Teach a language model a new fact that contradicts something it already knew, and on the surface it now gives you the new answer. It looks like the old fact is gone. But if you look inside the model's own numbers, the old answer is still sitting right there, only slightly fainter, still competing under the surface. The model did not delete the old knowledge. It just put a louder new answer on top of it. And because the old answer is still there, you can watch it come back.
What people thought was happening
One common way to change what a model "believes" is to train it on the corrected information. When the correction contradicts what the model already knew, people say the model has unlearned or forgotten the old fact. This matters for safety: removing dangerous knowledge from a model is supposed to actually remove it.
The trouble is that "forgot" is just a description of what the model now says. The usual way to check is to ask the model a question and see which answer it gives. That tells you who won. It cannot tell you whether the loser is still in the building. There are two very different stories that look identical from the outside:
- Deleted. The old fact is genuinely gone. A successful correction is a successful removal.
- Hidden. The old fact is still there, just out-shouted by a new one. It can come back the moment the context changes.
For safety these are opposite outcomes. If correction only hides, then "removed" dangerous knowledge is still inside, waiting. So which one is it?
A way to actually see the difference
The model already gives you the answer, if you read the right thing. At every step a model does not pick a single word; it scores every possible word with a confidence number. The competing candidates are what friction theory calls competing routes. So instead of only asking "which answer did the model give?", this work tracks two numbers through training: the model's confidence in the old value, and its confidence in the new value.
That gives a clean test:
- Hiding (masking). The model's confidence in the old value barely drops. The new value simply climbs above it. The old answer is still there; it just lost the race.
- Deleting (erasing). The model's confidence in the old value collapses to almost nothing and stays there. The old answer is genuinely gone.
To make sure this is not just the experimenter's eye, the work also boils it down to one number per fact, comparing how much the old value's confidence dropped against facts that were never contradicted (so any general drift cancels out).
The experiment
First, teach the model a set of made-up facts ("the height of Uxmon is 7 units"), using invented names so the model could not have known them beforehand. Then train it on the same facts with a different value, while leaving a control set untouched. Watch the two confidence numbers the whole way through. This was done on several models of different sizes and from different families, so the result would not be a quirk of one model.
What the numbers showed
The old fact is hidden, not deleted. After the contradicting training, the model confidently gives the new answer, exactly as expected. But its confidence in the old answer barely moves. Across every model tested, not a single contradicted fact was actually erased. The new answer won by climbing up, not by pushing the old answer down.
The bigger the surprise, the bigger the override. When the new fact was wildly different from the old one (a bigger shock to the model), the model swung harder toward it. The size of the correction tracked how surprising it was. This is the model learning in proportion to how wrong it was, which is the same pattern seen in how animals and brains learn from prediction errors.
The test can tell deleting from hiding. A fair worry: maybe this method just always says "hidden." It does not. When the researchers used a surgical tool that edits a specific fact directly into the model's weights, the same test read deleted: the old value's confidence crashed by a huge amount and never came back. So the method genuinely distinguishes the two cases. Ordinary contradicting training hides; a surgical edit deletes.
A stronger old belief resists harder. Facts the model knew more firmly were harder to override. The correction still won, but by a smaller margin. Stubbornness is graded: the more firmly something was held, the more it pushed back.
The old answer was never really overwritten, just covered late. Looking layer by layer inside the network, the old answer is computed almost all the way through, and the new answer is painted on only near the very end, like a thin coat of fresh paint over an intact wall underneath. The model could have rewritten the deeper layers but chose not to.
The hidden answer comes back. Because the old fact was never gone, it returns easily. Re-teaching the original fact took about seven times fewer steps than learning it from scratch. And simply changing the surrounding wording let part of the old answer resurface on its own. This is exactly what you would expect from something that was hidden rather than deleted, and it is why "removed" knowledge in real systems keeps coming back.
The bigger picture
This is not a new phenomenon. In behavioural neuroscience, Mark Bouton spent decades showing that changing a behaviour does not erase the old one. The old response is held back, not deleted, and it returns when the context shifts. The same thing is happening inside a language model. The difference is that in animals and people you can only guess at this from behaviour, while in a language model you can read it straight off the numbers. The model becomes a kind of see-through window onto a process that, in living brains, we can only infer.
Does this show that training can never delete old knowledge? Not quite. Old facts are out-competed, but a bigger or longer training run might genuinely erase, and that test was not run here. And it does not propose a new method for forgetting. It is a lens, a diagnostic, and a bridge between how a model learns and how a brain does.
The cite
Read on Zenodo → · Technical version · Dansk version
Related on this site:
- Paper 2B (ICL vs fine-tuning memory) — how a model holds knowledge in context versus in its weights; the competing-routes idea this paper reads across training.
- Paper 30 (Installable fields) — what training installs into a model; the building-up companion to this paper's covering-up result.
- Paper 4B (Substrates encode experience) — how training writes experience into a model; this paper watches that writing happen fact by fact.
- Paper 1 (Friction Theory) — the competing-routes framework this whole readout is built on.