Should you fine-tune your model, or just prompt it?
Teaching a model new habits by retraining it often breaks it. Showing it examples works better and costs almost nothing
Say you want a language model to do something specific: double-check its facts before answering, say when it isn't sure, or refuse questions built on a false premise. You have two main ways to install that habit:
- Fine-tuning — retrain the model on examples of the new behaviour. It changes the model permanently.
- In-context learning — just put a handful of examples in the prompt each time. The model copies the pattern, and nothing about it changes permanently.
The common assumption is that fine-tuning is the "serious" option and prompting is a stopgap. For the behaviours we tested, that assumption is backwards at large scale.
The short version
Showing a big model examples in the prompt beat every version of retraining we tried: it was cheaper (a few cents instead of $5–15), it didn't damage the model, and you can change your mind any time. And what made retraining succeed or fail wasn't how much data we used. It was whether the answer-format we taught looked like how the model already talks.
Why retraining can quietly break a model
Here is the result that surprised us. We retrained the same model two ways, changing only the shape of the answers in the training examples:
- A short format — just an answer and a confidence level. Close to how the model already replies.
- A long format — answer, then "relevant facts," then "verification," then "conclusion," then confidence. Five labelled sections.
We measured each version on a broad general-knowledge test (this is the canary: if retraining damages the model's core reasoning, this score falls first). Same data, same amount, same method. Opposite results:
| Training examples | Short format | Long format |
|---|---|---|
| 90 | 83% | 40% |
| 270 | 83% | 10% |
| 540 | 87% | 3% |
The short-format model stayed healthy. The long-format model fell apart, and the more we trained it, the worse it got, dropping from a healthy 86% to almost nothing. We then pushed the short format all the way to twenty-four times as much data. It never broke.
So "more data made it worse" is not a fact about heavy retraining in general. It is a fact about teaching a model an answer-format that fights the way it naturally talks. When the format fits, more data is harmless. When it clashes, more data makes the clash worse. The model ends up forcing its odd new template onto everything, even questions where it makes no sense.
Bigger models are easier to break, not harder
You might expect a more capable model to shrug this off. The opposite is true. The same prompt-based recipe that helped mid-sized models hurt the most capable one we tested:
| Model | Without the recipe | With it | Result |
|---|---|---|---|
| Llama-3.3-70B | 25.8% | 33.3% | helped |
| gpt-4o-mini | 30.3% | 34.8% | helped |
| gpt-4o | 42.9% | 41.4% | hurt |
The better the model already is, the less the help helps, until, for the strongest model, it actively gets in the way. The capable model already does the careful thing on its own, so the extra coaching is just noise it has to work around. This is a known effect in human learning too: training wheels help a beginner and slow down an expert. It shows up in language models the same way. (We explain the human version on the learning page.)
What to actually do
- Reach for prompting first. Especially on large models, and especially when the behaviour needs a wordy, structured answer. Retraining there tends to backfire.
- If you must retrain, count the sections. Look at the answers in your training data. If they have many more labelled parts than the model's normal reply, expect trouble — you can spot this for free, before spending anything.
- Don't over-mix. A focused set of examples beats a grab-bag of many different styles. We found that mixing in too many strategies did worse than doing nothing.
- Don't expect one skill to spill over. Training a model on music did not make it generally smarter — it made it worse at everything. Today's models don't transfer a narrow skill into broad ability the way the brain analogy suggests.
The money side makes the choice easy. Prompting cost us a few cents per test and could never damage the model. Retraining cost $5–15 per attempt and, when it went wrong, the damage was permanent until we retrained again. A free check plus a few-cent prompt test can save a failed retraining run.
Related pages
- Learning — why training wheels help beginners and slow experts, in models and people
- Prompting vs retraining as memory — why the two work so differently
- Phenomena — where models and brains behave alike
Based on Paper 4C (in preparation). The numbers, code, and full method are in the technical version.