Friction-Guided Inference: A Free Signal That Improves Any Large Language Model
Pødenphant Lund, T. (2026d) · Preprint · Live on Zenodo
A free signal from any logprob-returning API (Competing Routes from logprobs) plus calibrated strategy + abstention yields +12 to +21 percentage points on 5/5 tested cells across four architectures and four benchmarks (generality to untested settings is hypothesised, not established). On SimpleQA, the combined pipeline lifts Qwen3-235B past GPT-4o and GPT-4.1. Calibration cost: ~$1.50 per model-benchmark pair.
| DOI | 10.5281/zenodo.20014121 |
| Target venue | NeurIPS / ICML / TMLR |
| Status | Preprint live; submission package consolidated |
| Length | ~10,323 words |
| Author | Tomas Pødenphant Lund [ORCID] |
TL;DR
Large language models frequently possess the knowledge needed to answer correctly yet commit to the wrong response. The correct answer often appears in the top-k logprob distribution, assigned high probability but narrowly outranked by an incorrect alternative. We call this the commitment gap.
This paper presents friction-guided inference: a method using the model's own logprob distribution (available at zero cost from any OpenAI-compatible API) to (1) select calibrated correction strategies and (2) identify questions where the model should abstain rather than commit.
The key signal is competing routes (CR): the count of high-probability alternative tokens per position. CR detects when a model is uncertain (population-level AUC 0.53-0.68) and, through calibration, identifies which correction strategies help. An ablation reveals a sharp limit: CR does not reliably determine which answer is correct at the individual-question level. The lift comes from the strategy itself (asking the model to reconsider under different conditions) combined with CR-guided uncertainty thresholds that decide when to answer at all.
Headline result: a calibrated strategy pipeline alone produces +7.7 to +20.8 pp on four of five tested cells (mean +11.8 pp). When combined with CR-guided abstention on four cells where both were measured, the combined pipeline reaches +12 to +21 pp. The two mechanisms are complementary (strategy recovers commitment gaps, abstention prevents confident-wrong commits) and combine super-additively.
Strategy-only results (across two dense transformers, one mixture-of-experts, and one Liquid Neural Network; benchmarks: MATH-500, SimpleQA, MMLU-Pro, GPQA Diamond):
- Qwen2.5-7B · MATH-500 L4–5: 45.8% → 66.5% (+20.8 pp, CI [+15.1, +26.4])
- Qwen3-235B · SimpleQA: 41.1% → 51.8% (+10.6 pp, CI [+9.1, +12.1])
- Qwen3-235B · MMLU-Pro STEM: 55.2% → 62.9% (+7.7 pp, CI [+6.4, +9.0])
- LiquidAI LFM2 · MMLU-Pro STEM: 33.8% → 41.9% (+8.1 pp, CI [+7.1, +9.2])
- GPT-oss-20B · GPQA Diamond: 27.0% → 30.4% (+3.4 pp, CI [+0.0, +6.8] — positive trend, n=148)
Combined pipeline (strategy + 20% calibrated abstention), on the four cells where both components were measured:
- Qwen3-235B · SimpleQA: 41.6% → 57.2% (+15.6 pp) — surpasses GPT-4o (38.0%) and GPT-4.1 (40.0%)
- Qwen3-235B · MMLU-Pro STEM: 55.2% → 67.4% (+12.1 pp)
- LiquidAI LFM2 · MMLU-Pro STEM: 33.8% → 55.0% (+21.3 pp)
- GPT-oss-20B · GPQA Diamond: 26.3% → 45.5% (+19.2 pp)
Strategy-only mean across four statistically significant cells: +11.8 pp. Abstention-only (at 20% abstention) adds +6.5 to +14.1 pp success-rate improvement, at zero additional inference cost.
The method requires per-model calibration, but the pipeline code and signal are architecture-agnostic. Calibration is an online procedure that runs in approximately two hours of API calls at a cost of roughly $1.50 per cell. It is closer in cost and character to a deployment health check than to model training, and bears no resemblance to a hyperparameter search over model weights.
Calibration protocol summary
| Step | What it does | Sample size | Approx. time / cost |
|---|---|---|---|
| 1. Fixed multi-round baseline | Run all strategies N≥5 times each on full question set with logprobs=True; collect per-token CR + answer + ground truth | N≥5 rounds × full benchmark | ~30-60 min · ~$0.50 |
| 2. Analyse calibration data | Spike patterns per outcome group; retrievable vs epistemic classification; per-cell best-strategy; oracle plateau | Uses Step 1 data | ~15 min local Python · $0 |
| 3. Design adaptive pipeline | Set cell queues, stop criteria, and abstention threshold from Step 2 findings | Uses Step 2 output | ~15 min |
| 4. Validate adaptive run | Run adaptive pipeline on held-out subset to confirm strategy-best transfer + lift | ~50-100 questions | ~15-30 min · ~$0.50 |
| Total per model-benchmark pair | ~2 h · ~$1.50 | ||
All code, data, calibration protocols, and a 23-rule quality-assurance framework are released with the preprint.
Why it works
The commitment gap is a substrate-level phenomenon, the friction-ceiling pattern from Paper 1 §9.1b: retrieval succeeds but commit fails. The correct answer is statistically accessible, but the model commits to a marginally more probable wrong one. This is statistical structure, not arbitrary noise; it is the operational signature of race-resolution under bounded resources. A commitment gap can be closed by re-prompting the model under different conditions, which causes resampling and sometimes lands on the correct alternative. A genuine knowledge gap, where the correct answer has negligible probability, cannot.
The CR signal is free: it requires only logprobs=True in the API call. It needs no additional model calls, no external verifier, no labelled training data. Because CR measures a structural property of probabilistic token generation, it has so far transferred across the tested architectures, model families, and benchmarks; broader generality is hypothesised but not yet established outside the present panel.
Companion papers
- Paper 1 (Friction Theory) — the substrate theory the friction-ceiling concept rests on (§9.1b)
- Paper 2 (Capacity Scaling) — the encoding-side complement; documents the same retrieval-versus-derivation distinction