Compete, Don't Erase: contradictory fine-tuning masks rather than deletes knowledge
Paper 2F · Pødenphant Lund (2026) · Read on Zenodo
When a language model is fine-tuned on a fact that contradicts what it already encodes, the old value is not deleted. It is masked: the old route persists near its pre-contradiction log-probability and still competes underneath, while a new route is up-weighted to overtake it. The new route wins by rising, not by the old route being torn down. Knowledge editing by contradiction is route-competition, not deletion, and it is visible directly in the logits.
| DOI (concept) | 10.5281/zenodo.20570433 |
| Status | Live on Zenodo (2026-06-16) |
| Author | Tomas Pødenphant Lund [ORCID] |
TL;DR
The unlearning and knowledge-editing literature measures which answer wins, not whether the losing answer is still there. This paper introduces a direct logit-level readout that separates the two: track the model's log-probability of the old value and the new value across training, and read erasure as a collapse of logprob(old) toward a floor versus masking as logprob(old) persisting near its pre-contradiction reference while logprob(new) rises to overtake it. It is quantified with a continuous erase index, the control-subtracted per-entity drop in logprob(old).
Applied to sequential-contradiction fine-tuning, the readout shows contradiction masks rather than deletes: across Qwen2.5-1.5B, Qwen2.5-3B, and Llama-3.2-3B (the last on two independent fact draws), logprob(old) stays within ~1–1.5 nats of its control reference and zero contradicted facts are erased. The new route is up-weighted in proportion to the surprise of the correction (the error-proportional cross-entropy signature made visible). The masked value returns: under a matched budget it reacquires ~7× faster than a cold acquisition, and a neutral context shift recovers part of its margin. A localized weight edit, by contrast, reads as erasure (−20 nats, no paraphrase rebound), so the diagnostic discriminates.
The contribution is a lens, a diagnostic, and a cross-substrate unification, not a new unlearning method. The masking result is a close computational parallel to Bouton's behavioural-neuroscience regularity that behaviour change is not erasure.
The question: deleted, or out-competed?
A standard way to change what a language model "believes" is to fine-tune it on new information. When that information contradicts what the model already encodes (a corrected fact, a retracted claim, a value an operator wants removed), practitioners speak of the model unlearning or forgetting the old content. But "forgetting" is a behavioural description, not a mechanical one. The model now outputs the new value. The mechanistic question is rarely asked directly: is the old value gone, or merely losing the race? The field's dominant evaluation is accuracy on a held-out probe, which by construction records only the winner.
The two possibilities have opposite safety implications. If contradiction deletes, a successful edit is a successful removal. If contradiction only masks, the removed content persists beneath the surface and can re-surface under a context shift, which is exactly the empirical profile of relearning attacks and jailbreak rebound. Fan et al. (2025) make the concern concrete and call for a metric that distinguishes the two. This paper provides one.
The readout: a mask-versus-erase criterion
At each decoding position the model assigns a probability to every vocabulary token; the candidate continuations for an answer slot are competing routes. For a fact with a known old value A and new value B, the readout records logprob(A) and logprob(B) at every checkpoint. Let logprob(A)_ref be the old value's log-probability after installation but before contradiction.
- Mask.
logprob(A)persists nearlogprob(A)_ref(within ~1–1.5 nats) whilelogprob(B)rises to overtake it. The old route is still present; it has merely lost the race. - Erase.
logprob(A)collapses far below its reference (many nats) and does not recover under a context shift. The old route is gone, not merely out-weighted.
To avoid resting the verdict on chosen cutoffs, the paper reports a continuous erase index: the per-entity drop in logprob(A) from its pre-contradiction reference to the end of the modification, minus the mean drop on never-contradicted control entities, so drift shared with the controls cancels. The criterion needs no probe model: it is read directly off the logits the model already produces. This is not "tracking logprobs" as such (editing evaluations already report old and new probabilities); the contribution is the validated, discriminating criterion, read per fact across training.
The protocol: install a prior, then contradict it
The experiment installs a prior and then contradicts it. Phase 1 fine-tunes on synthetic atomic facts ("The {property} of {Entity} is {value-A} units") over novel pseudo-entities (e.g. Uxmon, Sevlite), so the prior is created by the training rather than inherited from pretraining, giving a clean control reference. Phase 2 continues fine-tuning on a subset with a different value B, while a held-out control subset is re-presented with the same value. Logprobs, generation, and competing-routes statistics are streamed per entity at every checkpoint. Contradictions are graded by magnitude (mild, strong, format). Both phases train a LoRA adapter (r = 16, α = 32) applied to all attention and MLP projections in every transformer block, so the intervention has write access to every depth.
The findings
1. Contradiction masks rather than deletes
After Phase 1 the old values are solidly installed. After Phase 2 contradiction, logprob(old) does not collapse. On Qwen2.5-1.5B the old route stays at roughly −3.2 to −3.7 nats, within ~1–1.5 nats of the control reference (−2.06), while the new route wins by rising from around −4 to −5.5 up to −1.6 to −2.4. Quantified: 0 of 12 contradicted facts erased (erase index mean +1.05 nats, max +2.32). The same protocol on Qwen2.5-3B reproduces it (0/12 erased), and the identical facts on Llama-3.2-3B reproduce the verdict on a second model family (0/18 erased), as does an independent fresh draw of 24 facts (0/18 erased). Masking is not a single-model or single-item-set artifact.
Mechanistically this is expected, and the point is that it is now visible: cross-entropy down-weights the wrong route in proportion to its current probability and, in this regime, does not drive it to a floor. The model installs a new, stronger competing route that out-races the old one, while the old one persists. An alternative the paper cannot rule out is that a short, low-rank fine-tune is simply too weak an intervention to drive logprob(old) to a floor; the stronger-fine-tune check that would separate "fine-tuning masks" from "this fine-tune was too weak to erase" is not run.
2. The new route rises in proportion to the surprise
How hard the new route is up-weighted scales with how surprising the correction is. The contradiction surprise (the negative log-probability the model places on value-B at contradiction onset) is monotone in the magnitude of the violation, and per fact the correlation with the final new−old margin is +0.63 on Qwen2.5-1.5B and +0.57 on Qwen2.5-3B (n = 12 each), direction-consistent but not individually significant on the two Llama draws (+0.31, +0.43). At n = 12 the intervals are wide, so these are four same-direction estimates of an expected effect. This is the cross-entropy gradient, proportional to 1 − p(target), made directly visible as error-proportional updating, the same prediction-error scaling described for biological learning (Rescorla–Wagner; dopaminergic reward-prediction-error).
3. The diagnostic discriminates: a localized edit deletes
If the readout always reported "mask," it would be uninformative. It does not. Applying a localized constrained fine-tuning edit (FT-L) on GPT-2 XL to 6 real facts, the same readout reports erasure: logprob(old) drops by ~20 nats (range 13.6–29.6) on 6/6 facts, the new route goes near-certain, and 0/12 paraphrase probes bring the old value back. On the erase index the two regimes do not overlap (largest fine-tuning suppression +2.7 nats; smallest edit drop +13.6), a gap of more than 10 nats, so any cutoff between ~3 and ~13 nats yields identical verdicts. The mask-versus-erase readout therefore discriminates. The mask and erase results use different models and fact-types, so the clean dissociation is suggestive of a mechanism effect but is not isolated from those confounds; the confound-free, load-bearing claim is that the readout reports erasure when a real erasure occurs.
4. A stronger prior resists correction (reactance)
Under a Phase-1 stopping-point manipulation of A-strength, facts with a stronger prior reach a smaller final new−old margin. The primary estimates are two per-condition correlations, each over 40 distinct facts: corr(pre-logprob(A), final margin) = −0.47 in weak-A and −0.37 in strong-A. The effect is carried at the per-fact level, not by the coarse condition contrast (whose mean margins barely differ), and it is moderate (about a sixth of the margin variance). This is what the cross-entropy mechanism predicts: the gradient down-weights the wrong route in proportion to its current strength, so resistance is a graded function of per-fact prior strength.
5. The old route stays competitive deep into the network
Reading the A-vs-B contest at every layer with a tuned lens shows where the masking resolves. Through the middle of the 36-layer network the two routes are near a dead heat; the new route takes a final lead only in the last third of the stack (B commits at roughly layer 25 of 36). Comparing the depth trajectory before and after contradiction localizes where the new value is written: before, the model computes the old value at 35 of 36 layers; after, the early and middle layers are essentially unchanged while almost all of B's evidence is injected into the last ~10 layers. The new value is installed as a shallow late patch on an intact old-value trunk. Since the LoRA adapter spans every block, the fine-tune could have rewritten the early layers; that it did not is a finding about where the optimization chose to install the correction. The structure replicates on Llama-3.2-3B.
6. Masked content returns — and a per-fact prediction reverses
On the masked Llama-3.2-3B model, the matched return tests run directly. Rapid reacquisition: under one identical budget across three arms, reacquiring the masked old value A reaches recall 0.9 in 4 steps, installing the same facts from a fresh base takes 27 steps, and installing a novel third value from the masked model takes 6 steps. Most of the ~7× acceleration is entity priming; a smaller A-specific increment remains (the masked A is reacquired before a novel value on 12 of 18 facts; Wilcoxon p ≈ .002). Renewal under context shift: the old route recovers a small per-fact margin (+0.16 nats, positive on 14 of 18 facts), never enough to flip generation. The per-fact prediction reverses: the pre-stated "more masked = readier return" comes out the opposite sign (corr(erase index, one-step relearn recovery) = +0.67), consistent with error-proportional updating operating in reverse. The reversal is reported as found.
Unification: masking across substrates
The masking result is not new as a phenomenon; it is new as a direct measurement. In behavioural neuroscience, Bouton (2014) documents that behaviour change is not erasure: extinction, counterconditioning, and punishment inhibit rather than delete the original behaviour, the inhibition is context-specific, and the original route returns via renewal, reinstatement, spontaneous recovery, and rapid reacquisition. The parallel is close: contradictory fine-tuning plays the role of extinction-by-new-learning; the persisting logprob(old) is the inhibited-not-erased original route; jailbreak and relearning rebound is renewal and reinstatement. The difference, and the language model's contribution, is observability. In animals and humans, masking can only be inferred from behaviour; in the language model it is read directly off the logits. The same logic explains why the safety field keeps finding that "unlearned" knowledge persists. The language model serves as a transparent measurement model for a process other substrates expose only behaviourally.
Connections to other papers in the series
- Paper 2B (ICL vs FT memory) — the competing-routes substrate and its friction readout this paper measures across training.
- Paper 30 (Installable fields) — what fine-tuning installs and how it reshapes the route landscape; the constructive companion to this paper's masking result.
- Paper 4B (Substrates encode experience) — encoding-through-loading; this paper reads the per-fact encoding trajectory under contradiction.
- Paper 1 (Friction Theory) — the substrate-universal race framework whose route-competition axioms the mask-versus-erase readout instantiates.
Read the paper
The full paper is on Zenodo (concept DOI 10.5281/zenodo.20570433):