Fine-Tuning and In-Context Learning Install Dispositions, Not Data

Paper 4C · Pødenphant Lund (2026) · Read on Zenodo

Practitioners reach for fine-tuning or in-context demonstration to make a model behave a certain way: check premises, abstain on a false assumption, commit only when warranted. This paper asks a prior question — what can such an intervention actually install? The finding is a division of labour. Both routes install a disposition (a tendency that reshapes how the model routes its answers); in the templates and tasks tested, neither installs correctness. Forcing an incompatible response template through fine-tuning destroys the model's answer-delivery behaviour while leaving its knowledge largely in place.

DOI (concept)10.5281/zenodo.20562086
StatusLive on Zenodo (2026-06-17)
AuthorTomas Pødenphant Lund [ORCID]

TL;DR

The paper frames every intervention as sculpting the model's route-landscape: the terrain of competing routes that incoming input flows through, where an instruction is the terrain and data is the flow. The two interventions studied — fine-tuning on instruction-style data and in-context demonstration — install an instruction (a disposition that reshapes routing). In these tests they do not install correctness: the behavioural-instruction route moves dispositions without moving out-of-distribution reasoning accuracy, and forcing an incompatible response template destroys answer-delivery while leaving knowledge largely in place.

The boundary is stated at exactly that scope. Whether compatible-template fine-tuning on a task's own data can lift in-distribution accuracy is a different operation the design does not test, and dedicated knowledge-editing, which grafts facts directly, is a different mechanism not equated with these two.

The route-landscape frame

Every intervention is read as a way of sculpting the model's route-landscape. The frame or instruction is the terrain; its basins are the routes a model can take to an answer; incoming input is the flow that runs through the terrain; and friction is the contest between basins for where the flow settles, read as competing routes — the alternative next-token continuations the model weighs at a choice point. Raw data without a directing frame carves no basins and starts no races. An instruction is what turns data into competing routes.

The landscape is used as an organizing lens, not as the paper's evidence. Every empirical claim is a direct measurement in standard terms — accuracy and commit-rate under a fixed decoding protocol, false-premise rejection and abstention rates, a paired McNemar test, and a closed-form crossover fit to measured trade-offs. Readers who prefer neutral vocabulary can read "incompatible-template fine-tuning degrades general capability" for "the basin over-deepens into a canyon" throughout.

The primary contributions

Template-compatibility, not data volume, decides the fine-tuning outcome at 70B scale

Fine-tuning Llama-3.3-70B on 540 examples of premise-checking (PCHECK, a short verdict-plus-note template) and, separately, 540 examples of a verbose forward-framing template — same substrate, same data scale, same LoRA configuration — gives opposite outcomes. On the full benchmarks, truncation-free and matched-precision, PCHECK preserves general capability (MMLU 82.3% versus a vanilla 84.5%, n = 14,042) while forward-framing collapses it (8.2% at the same dose). The controlling variable is the response template's compatibility with the model's natural output format, not the amount of training data: PCHECK absorbs 24× the dose with only a small cost, while forward-framing falls off a cliff between 90 and 120 examples.

What the incompatible template destroys is behaviour, not knowledge

The natural reading of "MMLU 8%" is that the model's knowledge is gone. Format-invariant scoring says otherwise. Re-scored with a forced-choice logprob readout, the collapsed dose-270 and dose-540 models both reach 78.3% against the vanilla model's 81.4% under the identical readout, and 84.8% on a convergent option-mention readout against a 25% floor. The fine-tune has overwritten the model's ability to deliver an answer in the task's free-generation format — only ~16% of responses emit an answer letter at all — while the underlying knowledge survives at near-vanilla level. The canyon, precisely stated, is behavioural displacement: an instruction-route intervention damages instruction-routing, not stored facts.

A truncation artifact that manufactures phantom accuracy

On a reasoning benchmark the generation token cap is not a neutral parameter. Under a small cap, a model taught a short answer-first template commits within budget while an open-reasoning condition is cut off before it emits a parseable answer, so scoring rewards the format rather than the reasoning. The paper's own earlier measurement showed a +10.1pp PCHECK ICL "lift" on GPQA Diamond that vanished under a force-commit protocol with a generous cap; the entire lift was commit-rate. The cheap diagnostic is to report commit-rate beside accuracy.

What in-context premise-checking installs

Measured truncation-free, 70-shot PCHECK ICL gives no accuracy gain on any of three substrates (Llama-3.3-70B −3.0pp, gpt-4o-mini +0.9pp, gpt-4o −3.6pp on GPQA Diamond, n=198). What it genuinely does is install caution: it sharply raises abstention and false-premise rejection (gpt-4o abstention 8% to 26%). It is a task-dependent caution disposition, useful where false premises exist and costly where they do not. The fine-tuned version does not reach an accuracy the overlay cannot — the two are accuracy-equivalent and fail on the same hard items, so the static-versus-dynamic distinction is about durability, not correctness.

When the doubt-disposition pays

Because the disposition is caution, its value is task-dependent, and the paper makes that quantitative across five substrates. The benefit is saturated: modern instruct models already reject false premises at 89–100%, even deceptive technical ones, so installing more caution cannot improve detection. Its effect is instead the capacity-gated cost of over-rejecting valid items, indiscriminate on small models and discriminative on strong ones. A closed-form crossover, f* = |Δa| / (Δr(1+k) + |Δa|), gives the false-premise density above which the disposition pays. It is offered as a descriptive model of the measured trade-offs — one held-out consistency check succeeds (gpt-4o) and one does not cleanly replicate, and the bootstrap intervals are wide — so what it delivers is the capacity-ordering and a way to reason about deployment, not a validated predictive constant.

The saturation has a falsifiable edge. On this account a base (pre-RLHF) model, whose competing routes have not been suppressed by alignment tuning, should have a lower baseline false-premise rejection rate and therefore real detection head-room — the head-room the installed disposition needs. The matched test (PCHECK on a base model) is the decisive open falsifier.

The boundary, and what it does not claim

The paper does not claim that fine-tuning is a bad idea or that in-context learning replaces it. Each installs instructions; neither installs correctness within the tested scope. It does not claim that no method can install facts: dedicated knowledge-editing (ROME, MEMIT) grafts facts directly into the weights and is a different mechanism. An illustrative contrast shows a localized weight-edit installing 28 counterfactual facts on GPT-2-XL cleanly (efficacy 100%, capability intact), so the canyon is read as mechanism mismatch — pushing data-as-truth through a route built to install instructions — not the price of installing information.

Connections to other papers in the series

Read the paper

The full paper is on Zenodo (concept DOI 10.5281/zenodo.20562086):

Pødenphant Lund, T. (2026). Fine-Tuning and In-Context Learning Install Dispositions, Not Data. Zenodo. https://doi.org/10.5281/zenodo.20562086

Read on Zenodo → · Plain English version · Dansk version