Mount Stupid in the machine: how evidence competition explains the Dunning-Kruger curve in a language model
Paper 21 · Pødenphant Lund (2026) · Read on Zenodo
The load-bearing quantity behind the Dunning-Kruger pattern — how strongly a person's competing answers compete before they commit — is never observed directly in people; it is reconstructed from confidence and accuracy scores entangled with regression-to-the-mean and scale-use. We change substrate. In a language model the same quantity is read straight off the next-token probability distribution, validated as a proxy for the sequential-sampling decision variable, then used as a model organism for the Dunning-Kruger curve.
| DOI (concept) | 10.5281/zenodo.20562415 |
| Published | 2026-06-16 · live on Zenodo |
| Author | Tomas Pødenphant Lund [ORCID] |
TL;DR
The Dunning-Kruger pattern — early overconfidence that outruns competence, a dip as the gap is recognised, then recalibration — is contested largely because of a measurement problem: the quantity that does the work, the competition between candidate answers before commitment, is never observed in people, only reconstructed. A large language model is a bounded-decision system in which that quantity is directly readable off the answer-token distribution.
We first validate the reading as a proxy for the latent decision variable that drift-diffusion, race, and leaky-competing-accumulator models use to describe bounded human choice. We then use the validated proxy as a model organism for the Dunning-Kruger curve and find a partial reproduction with principled divergences: the confident-before-competent rise, the recognition near-tie, and the recalibration slope are architectural; novice humility and the affective valley are present in the substrate but masked in behaviour by forced commitment and argmax decoding.
This inverts Paper 1's arrow. Paper 1 used friction theory to explain how a substrate decides; here the substrate's readable logprobs are used as a measurement model for bounded-decision cognition. Confidence is shown to track how the competing routes resolve, not competence.
The move: read the variable, don't infer it
The Dunning-Kruger effect names a pattern between competence and self-assessment: people with the least skill often over-estimate it the most, and calibration improves with expertise (Kruger & Dunning, 1999). The popularised curve has four landmarks — a brief novice phase of appropriate uncertainty, a steep rise to a peak of confident incompetence ("Mount Stupid"), a dip as the learner recognises how much they do not know ("the valley of despair"), and a slope of recalibration. Critics have argued the canonical curve can be produced by artefacts of reconstruction: regression to the mean, better-than-average scale-use, and noise in the metacognitive readout (Krueger & Mueller, 2002; Gignac & Zajenkowski, 2020).
This paper takes the measurement problem head on by changing substrate. At each token the model commits to one route out of many; the next-token probability distribution exposes, per decision, how many routes are live and how sharply one wins. In humans every estimate of the latent decision variable is a model-based inference from behaviour; in an LLM it can be read.
Step 1 — a validated proxy for the latent decision variable
On a format-matched multiple-choice paradigm (GPQA-Diamond, MMLU-Pro), each option token carries a log-probability, so the option log-probabilities are a literal race over the response options. Per item we read the balance of evidence (BoE) — the gap between the highest- and second-highest-probability option, the leaky-competing-accumulator difference between the winning and runner-up accumulator — and the chosen-token surprisal, the scalar the surprisal-to-reading-time tradition uses. Per the calibration rule, only mid-accuracy cells (30–70%) were retained; five model×benchmark cells across gpt-4o-mini, Qwen2.5-7B, and Llama-3.3-70B qualified (pooled n = 749).
- Race-model recovery. Binning by BoE, observed accuracy rises monotonically across quartiles (0.25 → 0.28 → 0.39 → 0.56; Q4 − Q1 = +0.32). The variable a sequential-sampling model would have to infer from human reaction times is here read directly and shown to predict choice correctness.
- It beats fitted scalar surprisal. In a logistic head-to-head, BoE beats surprisal on AIC in 5/5 mid-accuracy cells and pooled (ΔAIC = +54.3). Pooled, r(BoE, correct) = +0.27 while r(surprisal, correct) = −0.03.
- The decisive test — matched-surprisal dissociation. Within chosen-token-surprisal tertiles, where surprisal is approximately matched, BoE still separates correct from incorrect (pooled Δ = +1.44). Two items with identical chosen-option surprisal but different runner-up gaps differ in accuracy. Scalar surprisal cannot see the runner-up; the multi-accumulator balance of evidence can.
- Not just peakedness, and within-model. BoE beats full-distribution entropy on AIC in 5/5 cells; in a joint model BoE keeps essentially all its weight (β = +0.52) while entropy and surprisal collapse to ≈ 0. With cell fixed effects the advantage holds (ΔAIC = +56.8) and generalises out of sample (leave-one-cell-out AUC 0.665 for BoE).
- Converging validity in the DDM's home. On a perceptual two-alternative colour-coherence task (LLaVA-NeXT primary, Qwen2-VL secondary), the read-off evidence margin rises monotonically with stimulus coherence and its entropy falls (trial-level r = +0.78). The same multi-route margin indexes perceptual evidence strength in the drift-diffusion model's home paradigm.
Step 2 — the model organism: a partial Dunning-Kruger curve
The hidden-structure paradigm manufactures confident error. The rule is grade = index, EXCEPT multiples of 5 → grade = index + 100. The model sees only simple-rule examples (1→1 … 9→9) in context, so it confidently infers "grade = index" and applies it to unseen multiples of 5 — confidently wrong, Mount Stupid. Exceptions are then revealed one family at a time and a held-out probe set is re-measured at each stage, split into followers (genuine competence, a control) and hidden exceptions (the gap). Confidence is read from the first answer token: top-1 probability, top-two margin, and onset competition (how many option tokens carry probability ≥ 0.10 at commitment). Because this design reads confidence and competence on the same fixed held-out items, it is immune to the regression-to-the-mean critique that the human curve cannot escape.
Architectural — transfers, no biology required
- The confident-before-competent rise (Mount Stupid). At stage 0–1 the gap is the maximum possible (+1.00): top-1 confidence 1.00, 0% competent on the hidden exceptions, while the followers control sits at ≈ 0.00 gap at every stage. As an incomplete rule forms, evidence margin grows ahead of competence. It replicates on both Qwen2.5-7B and Llama-3.3-70B, and on a second, structurally different hidden rule (digit-7 trigger with a multiplicative transform).
- The recognition near-tie. At the contradiction onset the competing routes collapse to ≈ equal — the moment the violated pattern is recognised.
- The recalibration slope. The correct route eventually wins; the slope scales with model size (final competence climbs 0.00 → 0.40 → 0.80 → 1.00 across Llama-3-8B, Qwen2.5-7B, Llama-3.3-70B, Qwen3-235B). Mount Stupid is capacity-independent; escaping it is capacity-gated.
Readout-gated — absent from behaviour, present in the substrate
- Divergence 1: novice humility is a forced-commit artefact (H1), recovered by abstention. Under forced commitment the model is "born on Mount Stupid." Given an abstention channel, both models say UNSURE 100% of the time at the novice stage, and Llama-70B traces the proper arc (100% abstain when ignorant, 100% correct when informed). The calibrated "I don't know" is there; forced greedy decoding gags it. The same forced-versus-free-report manipulation governs human memory report accuracy (Koriat & Goldsmith, 1996).
- Divergence 2: the valley lives in the substrate margin (H2). The behavioural curve shows no dip, but at the contradiction onset the margin between the simple (wrong) and exception (correct) value collapses from ≈ 3.5–4.5 to ≈ 0.4 nats (≈ 1.5:1) while top-1 stays ≈ 0.72. Argmax hides the near-tie. Under a margin-gated decoding policy the abstention rate jumps to 0.6–1.0 exactly at the onset, and under stochastic decoding choice instability spikes there on the exceptions only — the predicted coupling, demonstrated. Alignment tuning over-commits more than base, replicating across Qwen and Mistral.
What the model organism buys
The substrate's transparency separates the Dunning-Kruger pattern into an architectural core and a biological overlay, and says where the seam is. The two phases that diverge — novice humility and the affective valley — are exactly the affective and metacognitive overlay of the human curve. They do not transfer to LLM behaviour because forced commitment, argmax decoding, and alignment over-commitment mask them, yet the substrate carries both. That is a more useful statement than "LLMs reproduce / fail to reproduce Dunning-Kruger": it tells a cognitive scientist which components are properties of bounded competing-route computation and which require the biological substrate, with a falsifiable apparatus manipulation that toggles each divergent phase on and off.
The metacognitive overlay as a confidence-modulator
The confidence readout has two levels. The competing-routes margin is a content-level readout, genuinely overconfident at novice because the contradicting answers are not yet encoded to compete. Sitting above it is a meta-level signal — a calibrated "have I sampled enough to commit?" — which can be calibrated even when the content-routes are not. This maps onto the separable first-order / second-order structure already operationalised in people by the meta-d′ framework (Maniscalco & Lau, 2012; Fleming & Daw, 2017). H1's near-100% novice abstention is an empirical handle on that meta-level calibration.
A worked application: self-efficacy as a capacity-gated readout
Self-efficacy — Bandura's (1977) belief in one's capability to carry out a task, the will-I-be-able-to judgement made before acting — is, on this account, a capacity-gated readout of the same pre-commitment race. The pre-commitment competition both subtracts from solving (more competition predicts lower accuracy) and is the material of self-assessment (the same margin drives calibrated abstention). In a transparent substrate the degrader and the reporter are the same per-item friction. They decouple under capacity: the degrading leg is substrate-cheap (friction predicts errors even in small models), but reading one's own race well enough to self-assess is capacity-gated, so a small model keeps the degrader and loses the reporter.
Limits and the human-study lever
This is a model-side result, not new human data. The convergences and divergences are claims about where the analogy holds, not measurements of human metacognition. Two load-bearing predictions are already consistent with existing human findings (free-versus-forced report, Koriat & Goldsmith, 1996; the first-order/second-order split, Maniscalco & Lau, 2012; Fleming & Daw, 2017). The reading is called a validated proxy, not an instrument, until human-aligned reliability is shown; three of the four cartoon landmarks reproduce; the "alignment over-commits" claim is scoped to the Qwen and Mistral families. The decisive future experiment is a preregistered human abstention study, run under conditions matched to the model run.
Connections to other papers in the series
- Paper 1 (Friction Theory) — the substrate-universal framework this paper inverts: Paper 1 explains how a substrate decides; here the substrate's readable logprobs become a measurement model for bounded-decision cognition.
- Paper 0 (BFT) — the biological specialisation; self-efficacy reads here as a derived readout of a field's capacity-versus-demand race rather than a separate module.
- Paper 14 (Logic as Reactance) — a stronger prior resists correction more; the same near-tie configuration this paper measures.
Read the paper
The full paper is on Zenodo (concept DOI 10.5281/zenodo.20562415):
Read on Zenodo → · Plain English version · Dansk version
Related on this site:
- The Dunning-Kruger effect — the plain-English walkthrough of the phenomenon this paper measures.
- What language models reveal about minds — the wider case for reading cognition off a transparent substrate.
- Paper 1 (Friction Theory) — the framework whose arrow this paper inverts.
- Paper 0 (BFT) — the biological specialisation.