Friction-Guided Inference

Paper 3 · Pødenphant Lund (2026d) · Read on Zenodo

I study language models to understand people.On SimpleQA, the open model Qwen3-235B answers correctly 41.6% of the time. With a small pipeline on top, that rises to 57.2%, and so past GPT-4o (38.0%) and GPT-4.1 (40.0%). The whole lift comes from one free signal that already sits in the model's output. No retraining, no external verifier. Calibration costs about 1.50 dollars per setup, and the same move lifts almost any model by 12 to 21 percentage points on demanding tasks.

What this is about

A language model often knows the right answer and says something wrong anyway. You can catch it the moment it happens and do something about it, on any language model, for almost no money.

The core observation: language models often have the knowledge they need to answer correctly, but they "lock onto" the wrong answer anyway. Sometimes they are confidently right. Sometimes they are confidently wrong. And sometimes they hesitate visibly, and the hesitation is detectable in their output. I have found a signal that makes the hesitation measurable, and a small pipeline built around that signal makes almost any language model substantially better.

The signal

The signal is what I call Competing Routes (CR): the number of candidate tokens that were within reach at each position in the model's output.

When the model is sure of itself, only one token dominates the distribution, and CR ≈ 1. When the model is split between alternatives, several tokens have similar probability, and CR can be 3, 5, 10. CR is the operational name for "the model was considering several answers."

The crucial property: CR is free. Any OpenAI-compatible API returns per-token logprobs if you ask for them with logprobs=True. You can compute CR from those logprobs in two lines of Python. No retraining. No external verifier. No fine-tuning. The signal sits there in every model's output, and almost nobody uses it.

Two mechanisms built on top

CR by itself is just a measurement. The paper develops two practical mechanisms that use it:

1. Strategy pipeline. When the model is unsure (high CR), give it another chance. Specifically: ask it to reconsider the question under slightly different conditions, for example step-by-step reasoning, pre-mortem checking, verification, or alternative framings. Different strategies help different models on different tasks. The paper shows how to calibrate the right strategy from 50–200 example questions, and the calibration costs about 1.50 dollars per model-benchmark pair.

2. Calibrated abstention. When the model is very unsure, let it say "I don't know." That sounds trivial, but it is not: standard models will commit to an answer even when CR is wildly high, because they are trained to. A small calibration step lets the model abstain on the 20% of questions where it is most likely to be confidently wrong, which removes a large part of the most harmful errors.

The friction-guided pipeline: one decision tree per question Model answers + CR CR threshold? (calibrated per model) low CR Accept the answer the model is sure enough medium CR Retry with a strategy (pre-mortem, step-by-step, verify...) very high CR Abstain "I don't know" Three branches, one signal. The calibration step (~$1.50 per model-benchmark) sets the thresholds.
The whole pipeline. Strategy and abstention are complementary: strategy recovers the answer when the model is unsure but salvageable; abstention prevents confidently-wrong answers when the model is very unsure.

Pre-mortem is one of the strategies, and it comes from human decision-making

A concrete example. Pre-mortem (the technique where you imagine your project has already failed and ask yourself why) is a well-established technique in human decision-making (Gary Klein 2007). It works because it activates routes that ordinary forward-looking reasoning does not: you reason from a state of imagined failure rather than imagined success, and that brings other considerations to the surface.

Pre-mortem turns out to be one of the most effective strategies in the calibration suite. It is not because the model was trained on Klein's work, and it is not because it works on every task (it does not, and on simple factual retrieval it often hurts). It is because the same architectural property that makes pre-mortem useful in people, namely that it forces the substrate to evaluate from a different starting state, also exists in language models. The substrate is different, but the technique transfers.

That is the broader pattern: strategies that work on people often work on language models, and the ones that work for the same architectural reasons (not the same biological reasons) transfer most reliably. Step-by-step reasoning, verification, alternative framings, pre-mortem: all of them have a root in human cognition and all of them show up in the calibration suite. The framework predicts which ones should transfer and which should not, and calibration tells you, per model and per benchmark, which ones actually do.

Concrete results

The strategy pipeline alone gives +7.7 to +20.8 percentage points on four of five tested cells, average +11.8 pp. Combined with calibrated abstention on the four cells where both were measured: +12 to +21 pp.

Tested across four model architectures (two dense transformers, one mixture-of-experts, one Liquid Neural Network) and four benchmarks (MATH-500, SimpleQA, MMLU-Pro, GPQA Diamond):

+12 to +21 pp lift on 5 of 5 tested cells 0% 20% 40% 60% 80% Accuracy Qwen2.5-7B · MATH 45.8 66.5 (+20.8) Qwen3-235B · SimpleQA 41.6 57.2 (+15.6) ★ Qwen3-235B · MMLU-Pro 55.2 67.4 (+12.1) LiquidAI LFM2 · MMLU-Pro 33.8 55.0 (+21.3) GPT-oss-20B · GPQA-D 26.3 45.5 (+19.2) vanilla friction-guided pipeline ★ Qwen3-235B on SimpleQA beats GPT-4o (38.0%) and GPT-4.1 (40.0%)
Vanilla versus friction-guided pipeline across 5 cells, 4 architectures (dense transformer, MoE, Liquid Neural Network), 4 benchmarks. Average lift +17.8 pp. The pipeline is architecture-agnostic; calibration costs ~1.50 dollars per setup.

These are not cherry-picked. Five tested cells, five non-trivial lifts. The friction-guided pipeline transfers across architectures (dense, MoE, LNN), across benchmarks (mathematics, factual retrieval, reasoning, hard science), and across model sizes (7B to 235B).

An honest limit: the friction ceiling

An important caveat. CR tells you when the model is unsure. It does not tell you which answer is correct on any individual question. The whole lift comes from:

What CR cannot do is catch confidently-but-wrong errors, that is, cases where the model is sure of itself but still wrong. By definition CR does not flag those, because there is no friction to flag. I call it the friction ceiling: a structural upper limit on what any friction-based signal can achieve. Paper 2B diagnoses where confidently-wrong error comes from at the substrate level: gradient training compresses the calibrated distribution, so the model loses access to the alternatives that would have flagged the error.

The friction ceiling is a limit, not a defeat. The pipeline still gives a substantial lift on every benchmark we tested. It just does not catch everything, and the honest reporting matters.

What it costs to deploy

Calibration is an online procedure that runs on about two hours of API calls and costs roughly 1.50 dollars per model-benchmark setup. It is a one-off deployment health check, not a model-training run. The strategies it chooses from are standard prompting techniques, so nothing proprietary, all reproduced openly in the preprint.

The pipeline is architecture-agnostic. It works on any language model with a standard OpenAI-compatible API that returns logprobs. We have tested it on Qwen, Llama, Mistral, GPT-oss, and LiquidAI's LFM2 with consistent results.

Why it matters beyond the paper

Beyond the empirical lift, this paper does something quietly important: it shows that the substrate-level theory has practical traction. Friction Theory predicted that LLM logprob distributions should carry an exploitable signal because the same race architecture that produces friction in brains produces it in transformers. Paper 3 cashes in that prediction.

The lift on SimpleQA, where Qwen3-235B gets past GPT-4o and GPT-4.1, is also a small but real demonstration that open-source models, with the right substrate-aware tool, can match frontier closed models on specific benchmarks, without retraining, for the price of a cup of coffee.

Related papers

The full technique is in the technical version: Paper 3 (English technical). All code, data, and calibration protocols are published with the preprint. The full paper is on Zenodo: DOI 10.5281/zenodo.20014121.