Fine-Tuning and In-Context Learning Install Dispositions, Not Data
Paper 4C · Pødenphant Lund (2026) · Read on Zenodo
Practitioners reach for fine-tuning or in-context demonstration to make a model behave a certain way: check premises, abstain on a false assumption, commit only when warranted. This paper asks a prior question — what can such an intervention actually install? The finding is a division of labour. Both routes install a disposition (a tendency that reshapes how the model routes its answers); in the templates and tasks tested, neither installs correctness. Forcing an incompatible response template through fine-tuning destroys the model's answer-delivery behaviour while leaving its knowledge largely in place.
| DOI (concept) | 10.5281/zenodo.20562086 |
| Status | Live on Zenodo (2026-06-17) |
| Author | Tomas Pødenphant Lund [ORCID] |
TL;DR
The paper frames every intervention as sculpting the model's route-landscape: the terrain of competing routes that incoming input flows through, where an instruction is the terrain and data is the flow. The two interventions studied — fine-tuning on instruction-style data and in-context demonstration — install an instruction (a disposition that reshapes routing). In these tests they do not install correctness: the behavioural-instruction route moves dispositions without moving out-of-distribution reasoning accuracy, and forcing an incompatible response template destroys answer-delivery while leaving knowledge largely in place.
The boundary is stated at exactly that scope. Whether compatible-template fine-tuning on a task's own data can lift in-distribution accuracy is a different operation the design does not test, and dedicated knowledge-editing, which grafts facts directly, is a different mechanism not equated with these two.
The route-landscape frame
Every intervention is read as a way of sculpting the model's route-landscape. The frame or instruction is the terrain; its basins are the routes a model can take to an answer; incoming input is the flow that runs through the terrain; and friction is the contest between basins for where the flow settles, read as competing routes — the alternative next-token continuations the model weighs at a choice point. Raw data without a directing frame carves no basins and starts no races. An instruction is what turns data into competing routes.
The landscape is used as an organizing lens, not as the paper's evidence. Every empirical claim is a direct measurement in standard terms — accuracy and commit-rate under a fixed decoding protocol, false-premise rejection and abstention rates, a paired McNemar test, and a closed-form crossover fit to measured trade-offs. Readers who prefer neutral vocabulary can read "incompatible-template fine-tuning degrades general capability" for "the basin over-deepens into a canyon" throughout.
The primary contributions
Template-compatibility, not data volume, decides the fine-tuning outcome at 70B scale
Fine-tuning Llama-3.3-70B on 540 examples of premise-checking (PCHECK, a short verdict-plus-note template) and, separately, 540 examples of a verbose forward-framing template — same substrate, same data scale, same LoRA configuration — gives opposite outcomes. On the full benchmarks, truncation-free and matched-precision, PCHECK preserves general capability (MMLU 82.3% versus a vanilla 84.5%, n = 14,042) while forward-framing collapses it (8.2% at the same dose). The controlling variable is the response template's compatibility with the model's natural output format, not the amount of training data: PCHECK absorbs 24× the dose with only a small cost, while forward-framing falls off a cliff between 90 and 120 examples.
What the incompatible template destroys is behaviour, not knowledge
The natural reading of "MMLU 8%" is that the model's knowledge is gone. Format-invariant scoring says otherwise. Re-scored with a forced-choice logprob readout, the collapsed dose-270 and dose-540 models both reach 78.3% against the vanilla model's 81.4% under the identical readout, and 84.8% on a convergent option-mention readout against a 25% floor. The fine-tune has overwritten the model's ability to deliver an answer in the task's free-generation format — only ~16% of responses emit an answer letter at all — while the underlying knowledge survives at near-vanilla level. The canyon, precisely stated, is behavioural displacement: an instruction-route intervention damages instruction-routing, not stored facts.
A truncation artifact that manufactures phantom accuracy
On a reasoning benchmark the generation token cap is not a neutral parameter. Under a small cap, a model taught a short answer-first template commits within budget while an open-reasoning condition is cut off before it emits a parseable answer, so scoring rewards the format rather than the reasoning. The paper's own earlier measurement showed a +10.1pp PCHECK ICL "lift" on GPQA Diamond that vanished under a force-commit protocol with a generous cap; the entire lift was commit-rate. The cheap diagnostic is to report commit-rate beside accuracy.
What in-context premise-checking installs
Measured truncation-free, 70-shot PCHECK ICL gives no accuracy gain on any of three substrates (Llama-3.3-70B −3.0pp, gpt-4o-mini +0.9pp, gpt-4o −3.6pp on GPQA Diamond, n=198). What it genuinely does is install caution: it sharply raises abstention and false-premise rejection (gpt-4o abstention 8% to 26%). It is a task-dependent caution disposition, useful where false premises exist and costly where they do not. The fine-tuned version does not reach an accuracy the overlay cannot — the two are accuracy-equivalent and fail on the same hard items, so the static-versus-dynamic distinction is about durability, not correctness.
When the doubt-disposition pays
Because the disposition is caution, its value is task-dependent, and the paper makes that quantitative across five substrates. The benefit is saturated: modern instruct models already reject false premises at 89–100%, even deceptive technical ones, so installing more caution cannot improve detection. Its effect is instead the capacity-gated cost of over-rejecting valid items, indiscriminate on small models and discriminative on strong ones. A closed-form crossover, f* = |Δa| / (Δr(1+k) + |Δa|), gives the false-premise density above which the disposition pays. It is offered as a descriptive model of the measured trade-offs — one held-out consistency check succeeds (gpt-4o) and one does not cleanly replicate, and the bootstrap intervals are wide — so what it delivers is the capacity-ordering and a way to reason about deployment, not a validated predictive constant.
The saturation has a falsifiable edge. On this account a base (pre-RLHF) model, whose competing routes have not been suppressed by alignment tuning, should have a lower baseline false-premise rejection rate and therefore real detection head-room — the head-room the installed disposition needs. The matched test (PCHECK on a base model) is the decisive open falsifier.
The boundary, and what it does not claim
The paper does not claim that fine-tuning is a bad idea or that in-context learning replaces it. Each installs instructions; neither installs correctness within the tested scope. It does not claim that no method can install facts: dedicated knowledge-editing (ROME, MEMIT) grafts facts directly into the weights and is a different mechanism. An illustrative contrast shows a localized weight-edit installing 28 counterfactual facts on GPT-2-XL cleanly (efficacy 100%, capability intact), so the canyon is read as mechanism mismatch — pushing data-as-truth through a route built to install instructions — not the price of installing information.
Connections to other papers in the series
- Paper 1 (Friction Theory) — the substrate-universal framework whose competing-routes reading of friction the route-landscape frame builds on.
- Paper 2B (ICL vs FT memory) — in-context learning as working memory, fine-tuning as long-term memory; here the static-versus-dynamic landscapes are the durability side of the same distinction.
- Paper 30 (Installable fields) — installing a function via fine-tuning; the companion question of what a fine-tune can and cannot put into a substrate.
- Capacity scaling — the capacity-gated cost of the doubt-disposition is the same discriminative-capacity gradient.
Read the paper
The full paper is on Zenodo (concept DOI 10.5281/zenodo.20562086):