Installing reasoning pathways: fine-tune or prompt?

A practical playbook for installing a reasoning behaviour into an LLM without breaking it

The one-line finding: what decides whether a fine-tune succeeds is template-compatibility (whether the response style you are teaching fights the model's natural one), not how much data you have. A short-template intervention rode a 24× dose increase untouched; a verbose-template one collapsed a model's general capability from 86% to 3% on the same data scale and recipe.

Fine-tuning to install a cognitive pathway (calibration, premise-checking, verify-before-commit) is expensive and, more importantly, risky in a way that is invisible until you measure baseline capability. A fine-tune can look fine on its target task while having quietly destroyed the model's general reasoning. A failed 70B run costs roughly $5–15 in compute plus deployment overhead, and the damage is permanent until you retrain. Worse: the damage scales with model size. A bigger model is not a safer fine-tuning target for this failure mode: it has more capacity to react badly.

This page is the practical companion to Paper 4C (Lund 2026, in preparation). It gives you a decision rule, a design framework, a free pre-flight check, and the empirical tables behind them.

On this page

1. The decision rule

For most teams, most of the time, at modern scale, the default is: try in-context learning (ICL) first. Fine-tuning earns its place only when ICL underperforms on the specific task and the response template you need is compatible with the model's natural output style.

The asymmetry that drives all of this: a free smoke test plus a $0.05 ICL evaluation can save you a $5–15 fine-tune failure, and ICL has no failure mode that destroys baseline capability.

2. Template-compatibility decides the outcome

We ran two interventions through identical dose-response curves on Llama-3.3-70B, same LoRA configuration, differing only in the response template the training data teaches:

Same data scale. Same substrate. Same recipe. Opposite outcomes (MMLU is the general-capability benchmark; it drops first when an intervention breaks baseline reasoning):

ExamplesPCHECK MMLUForward-framing MMLUPCHECK GPQAForward-framing GPQA
9083%40%40%33%
27083%10%40%7%
54087%3%43%13%

PCHECK at 540 examples: baseline intact, target task lifted. Forward-framing at the same 540 examples: the model is destroyed. To rule out "PCHECK 90 was a lucky sweet spot," we pushed PCHECK to 24× the baseline dose (2160 examples). It held a flat plateau the whole way: MMLU 83–87%, no collapse at any dose.

The reframe

"More data made it worse" is not a universal property of high-dose fine-tuning. It is a property of incompatible-template fine-tuning at high dose. When the template fits the model's natural style, dose is harmless. When it doesn't, dose is an accelerant. So the first question is never "how much data?" It is "does the response style I'm teaching fight the model's natural one?"

3. The inverse-U: bigger isn't safer

The same recipe that helps a mid-tier model can hurt a frontier one. We took the best free-signal recipe (a 70-example in-context PCHECK pool, plus logprob-recovery when the text answer doesn't parse) and ran it on GPQA Diamond at a robust sample size (n=198) across two model families:

SubstrateVanilla+ recipeChangeRegime
Llama-3.3-70B25.8%33.3%+7.6ppmid-tier — helps
gpt-4o-mini30.3%34.8%+4.5ppmid-tier — helps
gpt-4o42.9%41.4%−1.5ppfrontier — reversed

Read top to bottom: as the substrate's baseline capability rises (25.8% → 30.3% → 42.9%), the recipe's benefit shrinks and finally turns negative. The frontier model is hurt by the same prompt that helps mid-tier ones: it already does premise-checking, so the demonstration is noise. This is the expertise-reversal effect (Kalyuga et al. 2003): instructional supports that help novices hurt experts, here reproduced across substrate capacity.

This is not an isolated result. The companion paper on capacity (Paper 2D) measures the same inverted-U through a completely different signal (the effect of in-context framing on the model's internal token-level competition) and finds the same shape, which it calls "Yerkes-Dodson on frame-effectiveness." Two paradigms, one curve. See the substrate-graded expertise-reversal discussion for the broader pattern.

4. The race-start framework (three qualities)

The framework's race-start metaphor (from the ICL-as-working-memory paper and the broader series): an installed pathway sets where the race starts and how broad the starting direction is. Our data shows this is not one dimension but three, each of which must be calibrated to the substrate. PCHECK is the only intervention we tested that lands well on all three at once.

QualityFails if too lowFails if too high
BreadthOne narrow direction activates too little of a large model's route-spaceToo many disparate directions overload the model; signal goes diffuse
Task-relevanceGeneric priming teaches no transferable skillOver-specialised prompt generalises narrowly
Template-compatibility(no real failure here)Verbose multi-section template fights natural style → reactance, baseline destruction

Template-compatibility is the quality with the strongest direct evidence (the counterfactual above) and the one engineers most often get wrong, because a richly-structured template looks like more thorough training data. At scale it is poison.

A corollary: episode count is a multiplier of fit-quality, not an independent lever. With an aligned intervention, more examples help; with a misaligned one, more examples fail to help or actively harm. One well-aligned example teaches more than a hundred misaligned ones.

5. Anti-patterns to avoid

Each of these is a real failure we observed, not a hypothetical:

6. The ICL deployment pattern

When ICL is the answer, build the pool to mirror the three winning qualities: task-aligned (the examples should be the task, not a meta-frame around it), mixed content at moderate breadth (one coherent frame with internal variety, not a grab-bag), and natural in style (don't smuggle a heavy template in through the examples).

In-context learning with a well-built 70-example pool matched or beat every fine-tune we tried, at roughly $0.05 per evaluation instead of $5–15 per fine-tune, with zero baseline damage and full per-query reversibility. At high shot counts the model's general capability actually rose above baseline: the opposite of same-scale fine-tuning.

Logprob-recovery is a free companion signal. When the model's text answer can't be parsed cleanly, fall back to the most probable answer-option from the token logprobs. This recovers a few points of accuracy at zero extra inference cost. Always request logprobs so the fallback is available.

7. A free pre-fine-tune check

The single cheapest predictor of fine-tune trouble is section-count. Count the output sections your training responses force, and compare to how the model naturally answers (about two: an answer and a confidence). If your template forces four or five sections, you are in collapse territory at scale. You can eyeball this before spending a cent.

We also built a lightweight template-compatibility predictor that samples your training prompts, queries the model on the same prompts, and scores how far your response style diverges from its natural one. In our validation, the section-count component carried almost all of the signal. Length and semantic similarity were nearly flat across candidates. So: if you do nothing else, count your sections.

8. Caveats and scope

Related pages

Source: Paper 4C (Lund 2026, in preparation) · expertise-reversal anchor: Paper 1 §5.8.7