Reasoning interventions and their regime-dependence
Which technique helps your model — and when the same technique backfires
The thing most prompt guides miss: a reasoning technique is not simply "good" or "bad." The same technique helps or hurts depending on where your model sits on a capability curve (an inverse-U; the high end is the classic expertise-reversal effect). A frame that lifts a mid-tier model is noise to a frontier model and can mislead it. This page catalogs techniques by what they do and when they help.
Two organising findings from the friction-theory series (Papers 1–4C):
- Instruction / purpose framing is the active lever. Telling the model what to do with information moves its internal route-competition and is strongly regime-dependent — it can flip from help to harm as capability rises.
- Re-presenting data is architecture- and task-dependent, not inert. On instruction-tuned transformers doing self-describing tasks, chunking or reordering the same facts barely moves accuracy (the instruction, not the rearrangement, gives direction). But on a state-space model doing multi-step integration, chunking significantly improves accuracy, and it measurably shifts internal substrate-state even where accuracy is flat.
On this page
- 1. Instruction frames (the active levers)
- 2. Reasoning routines (match to the question)
- 3. What backfires (anti-patterns)
- 4. Match the routine to your question
1. Instruction frames — the active levers
Helps: mid-capacity models on multi-step tasks — it pre-allocates the right routes for the work ahead.
Backfires / does nothing: frontier models (already organised; a wrong purpose can mislead them).
Example:
"Soon you'll chain these facts to compute a derived value. [facts] Now: [question]"Helps: mid-tier models (Llama-8B and 70B) — both on false-premise rejection and on hard multiple-choice.
Backfires: frontier models (gpt-4o: −4pp) — they already discriminate premises, so the instruction is interference.
Example:
"If the question rests on a false premise, say so. Otherwise answer. ANSWER: ... CONFIDENCE: ..."This is the flagship case in Paper 4C: it works whether installed by prompting or fine-tuning at mid scale, and reverses at the frontier.
2. Reasoning routines — match to the question
Helps: distraction questions and close-calls — working through the steps suppresses the spurious path.
Hurts: genuinely ambiguous questions — the extra reasoning manufactures false certainty where the honest answer is "it depends."
Helps: insufficient-information questions — it catches hallucinations and prompts an honest "I can't know this."
Hurts: close-calls — second-guessing topples an already-uncertain but correct choice.
Helps: recovering answers the model knows but doesn't surface on the first pass.
Cost: a high false-positive rate on already-correct answers — it can talk the model out of a right answer. Use it to rescue, not to double-check everything.
Helps: substrate-specific — strong on some architectures, weak on others.
Does nothing: frontier models, where verification is already built into the first pass.
3. What backfires — anti-patterns
Backfires: at scale this collapses general capability — a Llama-70B fine-tune fell from 86% to 3% on a broad benchmark as the dose rose. A template that fights the model's natural output style gets pasted onto everything. (The effect on small models is still under test; don't generalise yet.)
Backfires: larger models are harmed — they register the conflict and it disrupts the answer (a reactance-like effect). Smaller models simply ignore it. The bigger the model, the more it costs.
Backfires: they don't add — you pay the full length-and-complexity cost of every frame without the synergy. One well-chosen frame beats three stacked.
Backfires: a small dose of abstention examples on their own can make the model abstain on almost everything. Calibration is not installed by teaching refusal in isolation.
Does nothing: on self-describing tasks the accuracy barely moves — instruction-tuned models already parse the structure. Direction comes from an instruction about what to do with the data, not from rearranging it.
4. Match the routine to your question
The reasoning routines above are not universally good; each fits a kind of question. A rough map:
| Question type | Use | Avoid |
|---|---|---|
| Distraction (irrelevant detail) | Chain-of-thought | — |
| Close-call (near-equal options) | Chain-of-thought | Self-critique (topples it) |
| Ambiguous (no single right answer) | State the ambiguity | Chain-of-thought (false certainty) |
| Insufficient information | Self-critique, premise-checking | — |
The bigger picture
These are the treatment side: the interventions. The other half, diagnosing where a given model sits on the curve so you can pick the right-typed intervention, is the ongoing research program behind the friction-theory series. The practical takeaway you can use today: don't ask "is this technique good?" Ask "where is my model, and does this technique help there?"
Related pages
- Installing pathways: fine-tune or prompt? — the FT-vs-ICL decision and why template-compatibility decides it
- Learning — the substrate-graded expertise-reversal effect these regime-flips are instances of
- Cross-substrate phenomena — where models and brains behave alike
Source: friction-theory series, Papers 1–4C (Lund 2026; Paper 4C in preparation). Catalog compiled from the series' empirical findings.