Friction-Guided Inference

Paper 3 · Pødenphant Lund (2026d) · Read on Zenodo

I study language models to understand people.On SimpleQA, the open model Qwen3-235B answers correctly 41.6% of the time. With a small pipeline on top, that rises to 57.2%, and so past GPT-4o (38.0%) and GPT-4.1 (40.0%). The whole lift comes from one free signal that already sits in the model's output. No retraining, no external verifier. Calibration costs about 1.50 dollars per setup, and the same move lifted every model I tested by 12 to 21 percentage points on demanding tasks (a handful of model-benchmark pairs so far, so the range is encouraging rather than a universal law).

What this is about

A language model often knows the right answer and says something wrong anyway. You can catch it the moment it happens and do something about it, on any language model, for almost no money.

The core observation: language models often have the knowledge they need to answer correctly, but they "lock onto" the wrong answer anyway. Sometimes they are confidently right. Sometimes they are confidently wrong. And sometimes they hesitate visibly, and the hesitation is detectable in their output. I have found a signal that makes the hesitation measurable, and a small pipeline built around that signal makes almost any language model substantially better.

The signal

The signal is what I call Competing Routes (CR): the number of candidate tokens that were within reach at each position in the model's output.

When the model is sure of itself, only one token dominates the distribution, and CR ≈ 1. When the model is split between alternatives, several tokens have similar probability, and CR can be 3, 5, 10. CR is the operational name for "the model was considering several answers."

What matters: CR is free. Any OpenAI-compatible API returns per-token logprobs if you ask for them with logprobs=True. You can compute CR from those logprobs in two lines of Python. No retraining. No external verifier. No fine-tuning. The signal sits there in every model's output, and almost nobody uses it.

A concrete example. Imagine the model generating a one-word answer to "What is the capital of France?" The API returns the top-5 candidates with their log-probabilities:

candidate token	logprob	probability
Paris	−0.04	0.96
paris	−3.91	0.02
The	−4.61	0.01

Only one token is really in play → CR ≈ 1. Now imagine the same model generating a token in a hard reasoning step, and the top-5 looks like this: "yes" 0.32, "no" 0.28, "maybe" 0.21, "it" 0.14, "unclear" 0.05 → CR ≈ 4 (four candidates within a meaningful margin). That number, computed per token, is the whole signal.

Two mechanisms built on top

CR by itself is just a measurement. The paper develops two practical mechanisms that use it:

1. Strategy pipeline. When the model is unsure (high CR), give it another chance. Specifically: ask it to reconsider the question under slightly different conditions, for example step-by-step reasoning, pre-mortem checking, verification, or alternative framings. Different strategies help different models on different tasks. The paper shows how to calibrate the right strategy from 50–200 example questions, and the calibration costs about 1.50 dollars per model-benchmark pair.

2. Calibrated abstention. When the model is very unsure, let it say "I don't know." This builds on a long line of work on confidence calibration and selective prediction (Guo and colleagues among others), the idea that a system should know when to abstain. That sounds trivial, but it is not: standard models will commit to an answer even when CR is wildly high, because they are trained to. A small calibration step lets the model abstain on the 20% of questions where it is most likely to be confidently wrong, which removes a large part of the most harmful errors.

The whole pipeline. Strategy and abstention are complementary: strategy recovers the answer when the model is unsure but salvageable; abstention prevents confidently-wrong answers when the model is very unsure.

Pre-mortem is one of the strategies, and it comes from human decision-making

A concrete example. Pre-mortem (the technique where you imagine your project has already failed and ask yourself why) is a well-established technique in human decision-making (Gary Klein 2007). It works because it activates routes that ordinary forward-looking reasoning does not: you reason from a state of imagined failure rather than imagined success, and that brings other considerations to the surface.

Pre-mortem turns out to be one of the most effective strategies in the calibration suite. It is not because the model was trained on Klein's work, and it is not because it works on every task (it does not, and on simple factual retrieval it often hurts). It is because the same architectural property that makes pre-mortem useful in people, that it forces the substrate to evaluate from a different starting state, also exists in language models. The substrate is different, but the technique transfers.

That is the broader pattern: strategies that work on people often work on language models, and the ones that work for the same architectural reasons (not the same biological reasons) transfer most reliably. Step-by-step reasoning, verification, alternative framings, pre-mortem: all of them have a root in human cognition and all of them show up in the calibration suite. The framework predicts which ones should transfer and which should not, and calibration tells you, per model and per benchmark, which ones actually do.

Concrete results

The strategy pipeline alone gives +7.7 to +20.8 percentage points on four of five tested cells, average +11.8 pp. Combined with calibrated abstention on the four cells where both were measured: +12 to +21 pp.

Tested across four model architectures (two dense transformers, one mixture-of-experts, one Liquid Neural Network) and four benchmarks (MATH-500, SimpleQA, MMLU-Pro, GPQA Diamond):

Qwen2.5-7B on MATH-500: 45.8% → 66.5% (+20.8 pp, strategy alone)
Qwen3-235B on SimpleQA, combined pipeline: 41.6% → 57.2% (+15.6 pp). That beats GPT-4o (38.0%) and GPT-4.1 (40.0%) on the same benchmark. An open-source model lifted past frontier closed models by the friction-guided pipeline.
Qwen3-235B on MMLU-Pro STEM, combined: 55.2% → 67.4% (+12.1 pp)
LiquidAI LFM2 on MMLU-Pro STEM, combined: 33.8% → 55.0% (+21.3 pp)
GPT-oss-20B on GPQA Diamond, combined: 26.3% → 45.5% (+19.2 pp)

Vanilla versus friction-guided pipeline across 5 cells, 4 architectures (dense transformer, MoE, Liquid Neural Network), 4 benchmarks. Average lift +17.8 pp. The pipeline is architecture-agnostic; calibration costs ~1.50 dollars per setup.

These are not cherry-picked. Five tested cells, five non-trivial lifts. The friction-guided pipeline transfers across architectures (dense, MoE, LNN), across benchmarks (mathematics, factual retrieval, reasoning, hard science), and across model sizes (7B to 235B).

An honest limit: the friction ceiling

An important caveat. CR tells you when the model is unsure. It does not tell you which answer is correct on any individual question. The whole lift comes from:

The strategy itself: giving the model another chance under new conditions, where its second attempt can resolve the uncertainty differently
Abstention: preventing it from committing to confident errors on questions where it does not have a reliable answer

What CR cannot do is catch confidently-but-wrong errors, that is, cases where the model is sure of itself but still wrong. By definition CR does not flag those, because there is no friction to flag. I call it the friction ceiling: a structural upper limit on what any friction-based signal can achieve. Paper 2B diagnoses where confidently-wrong error comes from at the substrate level: gradient training compresses the calibrated distribution, so the model loses access to the alternatives that would have flagged the error.

The friction ceiling is a limit, not a defeat. The pipeline still gives a substantial lift on every benchmark we tested. It just does not catch everything, and the honest reporting matters.

What it costs to deploy

Calibration is an online procedure that runs on about two hours of API calls and costs roughly 1.50 dollars per model-benchmark setup. It is a one-off deployment health check, not a model-training run. The strategies it chooses from are standard prompting techniques, so nothing proprietary, all reproduced openly in the preprint.

The pipeline is architecture-agnostic. It works on any language model with a standard OpenAI-compatible API that returns logprobs. We have tested it on Qwen, Llama, Mistral, GPT-oss, and LiquidAI's LFM2 with consistent results.

Why it matters beyond the paper

Beyond the empirical lift, this paper does something quietly important: it shows that the substrate-level theory has practical traction. Friction Theory predicted that LLM logprob distributions should carry an exploitable signal because the same race architecture that produces friction in brains produces it in transformers. Paper 3 cashes in that prediction.

The lift on SimpleQA, where Qwen3-235B gets past GPT-4o and GPT-4.1, is also a small but real demonstration that open-source models, with the right substrate-aware tool, can match frontier closed models on specific benchmarks, without retraining, for the price of a cup of coffee.

Related papers

Paper 1 — Friction Theory — the theoretical foundation; CR is the operational handle
Paper 2B — explains the friction ceiling at the substrate level: why confidently-wrong error exists
The Findings page — friction-guided inference as one of the methodological innovations

The full technique is in the technical version: Paper 3 (English technical). All code, data, and calibration protocols are published with the preprint. The full paper is on Zenodo: DOI 10.5281/zenodo.20014121.