Installing reasoning pathways: fine-tune or prompt?
A practical playbook for installing a reasoning behaviour into an LLM without breaking it
The one-line finding: what decides whether a fine-tune succeeds is template-compatibility (whether the response style you are teaching fights the model's natural one), not how much data you have. A short-template intervention rode a 24× dose increase untouched; a verbose-template one collapsed a model's general capability from 86% to 3% on the same data scale and recipe.
Fine-tuning to install a cognitive pathway (calibration, premise-checking, verify-before-commit) is expensive and, more importantly, risky in a way that is invisible until you measure baseline capability. A fine-tune can look fine on its target task while having quietly destroyed the model's general reasoning. A failed 70B run costs roughly $5–15 in compute plus deployment overhead, and the damage is permanent until you retrain. Worse: the damage scales with model size. A bigger model is not a safer fine-tuning target for this failure mode: it has more capacity to react badly.
This page is the practical companion to Paper 4C (Lund 2026, in preparation). It gives you a decision rule, a design framework, a free pre-flight check, and the empirical tables behind them.
On this page
- 1. The decision rule
- 2. Template-compatibility decides the outcome
- 3. The inverse-U: bigger isn't safer
- 4. The race-start framework (three qualities)
- 5. Anti-patterns to avoid
- 6. The ICL deployment pattern
- 7. A free pre-fine-tune check
- 8. Caveats and scope
1. The decision rule
For most teams, most of the time, at modern scale, the default is: try in-context learning (ICL) first. Fine-tuning earns its place only when ICL underperforms on the specific task and the response template you need is compatible with the model's natural output style.
- Substrate ≥ ~70B and the pathway needs a verbose, multi-section response? Do not fine-tune. Use ICL. A verbose template plus scale equals reactance collapse.
- Substrate ≥ ~70B and the template is short (close to natural style)? Either works; ICL is still cheaper, safer, and reversible — try it first.
- Substrate ≤ ~8–13B? Fine-tuning generally works at this scale. Still run the free check below; prefer ICL if you want reversibility or fast iteration.
The asymmetry that drives all of this: a free smoke test plus a $0.05 ICL evaluation can save you a $5–15 fine-tune failure, and ICL has no failure mode that destroys baseline capability.
2. Template-compatibility decides the outcome
We ran two interventions through identical dose-response curves on Llama-3.3-70B, same LoRA configuration, differing only in the response template the training data teaches:
- PCHECK (premise-discrimination) teaches a short template:
ANSWER: … CONFIDENCE: …— about two sections, barely different from how the model already answers. - Forward-framing teaches a verbose template:
ANSWER: … Relevant facts: … Verification: … Conclusion: … CONFIDENCE: …— about five sections.
Same data scale. Same substrate. Same recipe. Opposite outcomes (MMLU is the general-capability benchmark; it drops first when an intervention breaks baseline reasoning):
| Examples | PCHECK MMLU | Forward-framing MMLU | PCHECK GPQA | Forward-framing GPQA |
|---|---|---|---|---|
| 90 | 83% | 40% | 40% | 33% |
| 270 | 83% | 10% | 40% | 7% |
| 540 | 87% | 3% | 43% | 13% |
PCHECK at 540 examples: baseline intact, target task lifted. Forward-framing at the same 540 examples: the model is destroyed. To rule out "PCHECK 90 was a lucky sweet spot," we pushed PCHECK to 24× the baseline dose (2160 examples). It held a flat plateau the whole way: MMLU 83–87%, no collapse at any dose.
The reframe
"More data made it worse" is not a universal property of high-dose fine-tuning. It is a property of incompatible-template fine-tuning at high dose. When the template fits the model's natural style, dose is harmless. When it doesn't, dose is an accelerant. So the first question is never "how much data?" It is "does the response style I'm teaching fight the model's natural one?"
3. The inverse-U: bigger isn't safer
The same recipe that helps a mid-tier model can hurt a frontier one. We took the best free-signal recipe (a 70-example in-context PCHECK pool, plus logprob-recovery when the text answer doesn't parse) and ran it on GPQA Diamond at a robust sample size (n=198) across two model families:
| Substrate | Vanilla | + recipe | Change | Regime |
|---|---|---|---|---|
| Llama-3.3-70B | 25.8% | 33.3% | +7.6pp | mid-tier — helps |
| gpt-4o-mini | 30.3% | 34.8% | +4.5pp | mid-tier — helps |
| gpt-4o | 42.9% | 41.4% | −1.5pp | frontier — reversed |
Read top to bottom: as the substrate's baseline capability rises (25.8% → 30.3% → 42.9%), the recipe's benefit shrinks and finally turns negative. The frontier model is hurt by the same prompt that helps mid-tier ones: it already does premise-checking, so the demonstration is noise. This is the expertise-reversal effect (Kalyuga et al. 2003): instructional supports that help novices hurt experts, here reproduced across substrate capacity.
This is not an isolated result. The companion paper on capacity (Paper 2D) measures the same inverted-U through a completely different signal (the effect of in-context framing on the model's internal token-level competition) and finds the same shape, which it calls "Yerkes-Dodson on frame-effectiveness." Two paradigms, one curve. See the substrate-graded expertise-reversal discussion for the broader pattern.
4. The race-start framework (three qualities)
The framework's race-start metaphor (from the ICL-as-working-memory paper and the broader series): an installed pathway sets where the race starts and how broad the starting direction is. Our data shows this is not one dimension but three, each of which must be calibrated to the substrate. PCHECK is the only intervention we tested that lands well on all three at once.
| Quality | Fails if too low | Fails if too high |
|---|---|---|
| Breadth | One narrow direction activates too little of a large model's route-space | Too many disparate directions overload the model; signal goes diffuse |
| Task-relevance | Generic priming teaches no transferable skill | Over-specialised prompt generalises narrowly |
| Template-compatibility | (no real failure here) | Verbose multi-section template fights natural style → reactance, baseline destruction |
Template-compatibility is the quality with the strongest direct evidence (the counterfactual above) and the one engineers most often get wrong, because a richly-structured template looks like more thorough training data. At scale it is poison.
A corollary: episode count is a multiplier of fit-quality, not an independent lever. With an aligned intervention, more examples help; with a misaligned one, more examples fail to help or actively harm. One well-aligned example teaches more than a hundred misaligned ones.
5. Anti-patterns to avoid
Each of these is a real failure we observed, not a hypothetical:
- Verbose multi-section response templates at scale. The headline failure: forward-framing's five-section template collapsed 70B general capability to 3%.
- A single narrow strategy at 70B+. Forward-framing alone scored below the untrained baseline on hard questions.
- Too-broad mixes. A five-strategy pool scored worse than the untrained model on every metric. Breadth has an upper bound. (This matches Paper 2D's additivity ceiling: combining frames never beats the best single frame.)
- Strong abstract instructions. Telling the model a verbose multi-step protocol via system message degraded hard-question accuracy. Showing beats telling.
- Domain fine-tuning expecting cross-domain transfer. The brain analogy (learning music improves general cognition) does not transfer to current LLMs. Music fine-tuning degraded every benchmark. Domain fine-tuning is local adaptation, not global amplification.
6. The ICL deployment pattern
When ICL is the answer, build the pool to mirror the three winning qualities: task-aligned (the examples should be the task, not a meta-frame around it), mixed content at moderate breadth (one coherent frame with internal variety, not a grab-bag), and natural in style (don't smuggle a heavy template in through the examples).
In-context learning with a well-built 70-example pool matched or beat every fine-tune we tried, at roughly $0.05 per evaluation instead of $5–15 per fine-tune, with zero baseline damage and full per-query reversibility. At high shot counts the model's general capability actually rose above baseline: the opposite of same-scale fine-tuning.
Logprob-recovery is a free companion signal. When the model's text answer can't be parsed cleanly, fall back to the most probable answer-option from the token logprobs. This recovers a few points of accuracy at zero extra inference cost. Always request logprobs so the fallback is available.
7. A free pre-fine-tune check
The single cheapest predictor of fine-tune trouble is section-count. Count the output sections your training responses force, and compare to how the model naturally answers (about two: an answer and a confidence). If your template forces four or five sections, you are in collapse territory at scale. You can eyeball this before spending a cent.
We also built a lightweight template-compatibility predictor that samples your training prompts, queries the model on the same prompts, and scores how far your response style diverges from its natural one. In our validation, the section-count component carried almost all of the signal. Length and semantic similarity were nearly flat across candidates. So: if you do nothing else, count your sections.
8. Caveats and scope
- Sample size matters for the lift. Many condition tables are n=30, which shows the shape faithfully but inflates the magnitude. The 70B in-context GPQA lift is +19pp at n=30 but +7.6pp at the robust n=198. Trust the larger-n numbers for magnitude.
- Two model families tested (Llama and OpenAI). The frontier reversal was confirmed on gpt-4o, so the high-capacity end of the curve is not Llama-specific — but the fine-tuning reactance results are Llama-only so far.
- Premise-validation task family. The clean scaling of the short-template intervention may not generalise to maths, code, or multimodal pathways.
- From ongoing work. This is a practical summary of Paper 4C (in preparation); numbers may refine before publication.
Related pages
- Learning — the substrate-graded expertise-reversal effect that the inverse-U here is one instance of
- In-context learning as working memory, fine-tuning as long-term memory — the substrate-mechanistic account of why ICL and FT behave differently
- Cross-substrate phenomena — where human and LLM behaviour share a shape
Source: Paper 4C (Lund 2026, in preparation) · expertise-reversal anchor: Paper 1 §5.8.7