Should you fine-tune your model, or just prompt it?

Teaching a model new habits by retraining it often breaks it. Showing it examples works better and costs almost nothing

Say you want a language model to do something specific: double-check its facts before answering, say when it isn't sure, or refuse questions built on a false premise. You have two main ways to install that habit:

The common assumption is that fine-tuning is the "serious" option and prompting is a stopgap. For the behaviours we tested, that assumption is backwards at large scale.

The short version

Showing a big model examples in the prompt beat every version of retraining we tried: it was cheaper (a few cents instead of $5–15), it didn't damage the model, and you can change your mind any time. And what made retraining succeed or fail wasn't how much data we used. It was whether the answer-format we taught looked like how the model already talks.

Why retraining can quietly break a model

Here is the result that surprised us. We retrained the same model two ways, changing only the shape of the answers in the training examples:

We measured each version on a broad general-knowledge test (this is the canary: if retraining damages the model's core reasoning, this score falls first). Same data, same amount, same method. Opposite results:

Training examplesShort formatLong format
9083%40%
27083%10%
54087%3%

The short-format model stayed healthy. The long-format model fell apart, and the more we trained it, the worse it got, dropping from a healthy 86% to almost nothing. We then pushed the short format all the way to twenty-four times as much data. It never broke.

So "more data made it worse" is not a fact about heavy retraining in general. It is a fact about teaching a model an answer-format that fights the way it naturally talks. When the format fits, more data is harmless. When it clashes, more data makes the clash worse. The model ends up forcing its odd new template onto everything, even questions where it makes no sense.

Bigger models are easier to break, not harder

You might expect a more capable model to shrug this off. The opposite is true. The same prompt-based recipe that helped mid-sized models hurt the most capable one we tested:

ModelWithout the recipeWith itResult
Llama-3.3-70B25.8%33.3%helped
gpt-4o-mini30.3%34.8%helped
gpt-4o42.9%41.4%hurt

The better the model already is, the less the help helps, until, for the strongest model, it actively gets in the way. The capable model already does the careful thing on its own, so the extra coaching is just noise it has to work around. This is a known effect in human learning too: training wheels help a beginner and slow down an expert. It shows up in language models the same way. (We explain the human version on the learning page.)

What to actually do

The money side makes the choice easy. Prompting cost us a few cents per test and could never damage the model. Retraining cost $5–15 per attempt and, when it went wrong, the damage was permanent until we retrained again. A free check plus a few-cent prompt test can save a failed retraining run.

Related pages

Based on Paper 4C (in preparation). The numbers, code, and full method are in the technical version.