Fine-tuning hides old knowledge rather than deleting it

Paper 2F · Pødenphant Lund (2026) · Read on Zenodo

I use friction theory to look inside what a model actually does when you "correct" it.Teach a language model a new fact that contradicts something it already knew, and on the surface it now gives you the new answer. It looks like the old fact is gone. But if you look inside the model's own numbers, the old answer is still sitting right there, only slightly fainter, still competing under the surface. The model did not delete the old knowledge. It just put a louder new answer on top of it. And because the old answer is still there, you can watch it come back.

What people thought was happening

One common way to change what a model "believes" is to train it on the corrected information. When the correction contradicts what the model already knew, people say the model has unlearned or forgotten the old fact. This matters for safety: removing dangerous knowledge from a model is supposed to actually remove it.

The trouble is that "forgot" is just a description of what the model now says. The usual way to check is to ask the model a question and see which answer it gives. That tells you who won. It cannot tell you whether the loser is still in the building. There are two very different stories that look identical from the outside:

For safety these are opposite outcomes. If correction only hides, then "removed" dangerous knowledge is still inside, waiting. So which one is it?

A way to actually see the difference

The model already gives you the answer, if you read the right thing. At every step a model does not pick a single word; it scores every possible word with a confidence number. The competing candidates are what friction theory calls competing routes. So instead of only asking "which answer did the model give?", this work tracks two numbers through training: the model's confidence in the old value, and its confidence in the new value.

That gives a clean test:

To make sure this is not just the experimenter's eye, the work also boils it down to one number per fact, comparing how much the old value's confidence dropped against facts that were never contradicted (so any general drift cancels out).

The experiment

First, teach the model a set of made-up facts ("the height of Uxmon is 7 units"), using invented names so the model could not have known them beforehand. Then train it on the same facts with a different value, while leaving a control set untouched. Watch the two confidence numbers the whole way through. This was done on several models of different sizes and from different families, so the result would not be a quirk of one model.

What the numbers showed

The old fact is hidden, not deleted. After the contradicting training, the model confidently gives the new answer, exactly as expected. But its confidence in the old answer barely moves. Across every model tested, not a single contradicted fact was actually erased. The new answer won by climbing up, not by pushing the old answer down.

The bigger the surprise, the bigger the override. When the new fact was wildly different from the old one (a bigger shock to the model), the model swung harder toward it. The size of the correction tracked how surprising it was. This is the model learning in proportion to how wrong it was, which is the same pattern seen in how animals and brains learn from prediction errors.

The test can tell deleting from hiding. A fair worry: maybe this method just always says "hidden." It does not. When the researchers used a surgical tool that edits a specific fact directly into the model's weights, the same test read deleted: the old value's confidence crashed by a huge amount and never came back. So the method genuinely distinguishes the two cases. Ordinary contradicting training hides; a surgical edit deletes.

A stronger old belief resists harder. Facts the model knew more firmly were harder to override. The correction still won, but by a smaller margin. Stubbornness is graded: the more firmly something was held, the more it pushed back.

The old answer was never really overwritten, just covered late. Looking layer by layer inside the network, the old answer is computed almost all the way through, and the new answer is painted on only near the very end, like a thin coat of fresh paint over an intact wall underneath. The model could have rewritten the deeper layers but chose not to.

The hidden answer comes back. Because the old fact was never gone, it returns easily. Re-teaching the original fact took about seven times fewer steps than learning it from scratch. And simply changing the surrounding wording let part of the old answer resurface on its own. This is exactly what you would expect from something that was hidden rather than deleted, and it is why "removed" knowledge in real systems keeps coming back.

The bigger picture

This is not a new phenomenon. In behavioural neuroscience, Mark Bouton spent decades showing that changing a behaviour does not erase the old one. The old response is held back, not deleted, and it returns when the context shifts. The same thing is happening inside a language model. The difference is that in animals and people you can only guess at this from behaviour, while in a language model you can read it straight off the numbers. The model becomes a kind of see-through window onto a process that, in living brains, we can only infer.

Does this show that training can never delete old knowledge? Not quite. Old facts are out-competed, but a bigger or longer training run might genuinely erase, and that test was not run here. And it does not propose a new method for forgetting. It is a lens, a diagnostic, and a bridge between how a model learns and how a brain does.

The cite

Pødenphant Lund, T. (2026). Compete, Don't Erase: contradictory fine-tuning masks rather than deletes knowledge. Zenodo. https://doi.org/10.5281/zenodo.20570433

Read on Zenodo → · Technical version · Dansk version

Related on this site: