LLMs aren't calculators

All the things that would seem strange if you thought they were just fast spreadsheets

Too little or too much challenge, and performance drops; the middle is the sweet spot

We tend to treat a language model like a very fast calculator: give it text, it does the maths, you get text back. On that picture, more information should always help, more explanation should always teach better, and if the model says it is confident, it is probably right.

None of these things are true. Language models behave a lot more like people than like calculators. They get overloaded by too much information. They get confused by too much explanation. They get anchored by the first thing they say. They suffer reactance when you tell them what to do. They show the same inverted-U on challenge level that mice show. And their stated confidence and their actual accuracy come apart in specific, predictable ways.

The explanation is a feature language models share with brains: they resolve competing answers under limited bandwidth, settling on one answer at a time. Once that architecture is in place, a particular set of surprises follows on its own.

Things a calculator wouldn't do

Information overload

If you ask a calculator to add 100 numbers it doesn't get worse at the answer when you give it 50 more relevant numbers. Language models do. Past a certain amount of context, accuracy drops as you add more material, even when the extra material is correct, on-topic, and well-written. The model does not run out of room. The extra material competes with the answer-relevant material for the same limited bandwidth. The model that knew the answer at 500 words can get it wrong at 5,000.

Accuracy rises with relevant context up to a useful range, then falls as additional material (even on-topic, correct material) competes with the answer-relevant content for the same finite resolution-bandwidth. The exact thresholds depend on the model and task; the shape is the universal pattern.

The overexplanation effect

Take a worked example. Now make it more thorough: more steps, more commentary, more careful framing of what's being shown. You'd expect the learner to do better. Often they do worse. This is true for human students and it's true for language models being fine-tuned on the example. The reason is the same in both cases: the elaboration competes with the underlying principle for the room to learn it. A shorter, less complete example often teaches better.

Completeness and learnability are not the same property

It feels intuitive that the most informative message is also the most learnable one. It isn't. Completeness is a property of the sender: whether you packed everything in. Learnability is a property of the recipient: whether there is room enough to learn it. A perfectly complete message can be unlearnable; an incomplete message can teach beautifully. The two properties trade off, and the trade-off depends on the recipient, not the sender.

Anchoring — the first word shapes the rest

Whatever a language model says first colours everything it says afterwards. This isn't a bug; it's a property of any system that generates one token at a time, where each token influences the distribution from which the next is drawn. Tversky and Kahneman documented this in humans in the 1970s and got a Nobel Prize for it. Language models do exactly the same thing, and you can measure it directly: change only the first token the model settles on, and the final answer changes at meaningful rates.

Path-dependence — the route through training matters

Take the same training data. Show it to the model in two different orders. You get two different models with measurable differences on the same test, one that was kept out of training. This is hysteresis: the end state depends on the order, not just the content. It is also why humans who learn the same material in different orders often end up with different skills, even when the test is the same.

Things a calculator wouldn't suffer from

Reactance — instructions are themselves routes

Tell a child "don't think of a pink elephant" and they immediately think of a pink elephant. The instruction itself activates the route it tries to prevent. That is not a quirk of children; it is a structural fact about any architecture that has to manage competing routes. Instructions don't just transparently transmit their intent. They add a route the system now has to handle.

The same shows up sharply in language models. The strongest documented case: when you demand a model answer in a format that conflicts with how it was trained (say, "just yes or no" from a model trained to elaborate), accuracy can collapse from 70% to 48%. The model isn't defiant. The model has been given a new route (the format instruction) that now competes with the original task-route, and the two routes disrupt each other. The output suffers for it.

Format-violation experiments (n=50 per condition × 3). When the instruction asks for a format that conflicts with the model's trained output style, accuracy collapses 22 percentage points. The model isn't refusing; the format instruction adds a competing route that interferes with the answer route.

RLHF-trained models show reactance more strongly than base models, because RLHF makes them more responsive to instructions in general, helpful and unhelpful alike. The very thing that makes them follow instructions is the thing that makes them vulnerable to reactance. There is no version of the architecture that takes instructions seriously and ignores instructions when they would hurt; both behaviours come from the same machinery.

The inverted U on challenge

The well-known Yerkes-Dodson curve also shows up in language models: performance is best at moderate challenge, and worse both when the task is too easy and when it's too hard. First seen in mice in 1908, found since in slime moulds, worms, mammals, and now language models. The reason this is universal, not just biological, is that it's the only performance curve possible for any system that resolves competing candidates under finite bandwidth. Easy tasks waste bandwidth; hard tasks overload it; the middle is the only place that works.

The inverted-U is mathematically required for any system that resolves competing candidates under finite resources. Observed in qubits (10⁻¹⁵s), molecular kinetics (10⁻⁹s), stochastic resonance (10⁻³s), and Yerkes-Dodson on mice and humans (10³s). Same shape, fifteen orders of magnitude apart.

Things that look like correctness but aren't

Confident-wrong

When a model is "sure" about its next word (the math behind the scenes shows very low ambiguity), it is more likely to be right, but not always. A real fraction of model errors come out of states where the model was, by every internal measure, confident. This is the trap any confidence-based safety system runs into eventually: the system can only catch the errors it's uncertain about, and confident-wrong errors are by definition the ones it isn't uncertain about. Humans do exactly the same thing; the technical term is "metacognitive failure".

Expertise reversal

Worked examples help beginners and hurt experts. This has been known in educational psychology for fifty years (Sweller, Kalyuga). Llama-3.3-70B reproduces it cleanly: 73% accuracy with no examples, 50% with one example, 61% with three. More guidance, worse performance, then partial recovery as the "interference" from the example fades. A calculator has no analogue for this. A language model has a structural reason for it.

Why this isn't a coincidence

The reason language models look human in these specific ways is the same reason humans look like slime moulds in these specific ways: shared architecture, not shared biology. The biology is the contingent part: the implementation. The architecture is the necessary part: the friction.