Friction-Guided Inference: A Free Signal That Improves Any Large Language Model

Pødenphant Lund, T. (2026d) · Preprint · Live on Zenodo

A free signal from any logprob-returning API (Competing Routes from logprobs) plus calibrated strategy + abstention yields +12 to +21 percentage points on 5/5 tested cells across four architectures and four benchmarks (generality to untested settings is hypothesised, not established). On SimpleQA, the combined pipeline lifts Qwen3-235B past GPT-4o and GPT-4.1. Calibration cost: ~$1.50 per model-benchmark pair.

DOI10.5281/zenodo.20014121
Target venueNeurIPS / ICML / TMLR
StatusPreprint live; submission package consolidated
Length~10,323 words
AuthorTomas Pødenphant Lund [ORCID]

TL;DR

Large language models frequently possess the knowledge needed to answer correctly yet commit to the wrong response. The correct answer often appears in the top-k logprob distribution, assigned high probability but narrowly outranked by an incorrect alternative. We call this the commitment gap.

This paper presents friction-guided inference: a method using the model's own logprob distribution (available at zero cost from any OpenAI-compatible API) to (1) select calibrated correction strategies and (2) identify questions where the model should abstain rather than commit.

The key signal is competing routes (CR): the count of high-probability alternative tokens per position. CR detects when a model is uncertain (population-level AUC 0.53-0.68) and, through calibration, identifies which correction strategies help. An ablation reveals a sharp limit: CR does not reliably determine which answer is correct at the individual-question level. The lift comes from the strategy itself (asking the model to reconsider under different conditions) combined with CR-guided uncertainty thresholds that decide when to answer at all.

Headline result: a calibrated strategy pipeline alone produces +7.7 to +20.8 pp on four of five tested cells (mean +11.8 pp). When combined with CR-guided abstention on four cells where both were measured, the combined pipeline reaches +12 to +21 pp. The two mechanisms are complementary (strategy recovers commitment gaps, abstention prevents confident-wrong commits) and combine super-additively.

Strategy-only results (across two dense transformers, one mixture-of-experts, and one Liquid Neural Network; benchmarks: MATH-500, SimpleQA, MMLU-Pro, GPQA Diamond):

Combined pipeline (strategy + 20% calibrated abstention), on the four cells where both components were measured:

Strategy-only mean across four statistically significant cells: +11.8 pp. Abstention-only (at 20% abstention) adds +6.5 to +14.1 pp success-rate improvement, at zero additional inference cost.

The method requires per-model calibration, but the pipeline code and signal are architecture-agnostic. Calibration is an online procedure that runs in approximately two hours of API calls at a cost of roughly $1.50 per cell. It is closer in cost and character to a deployment health check than to model training, and bears no resemblance to a hyperparameter search over model weights.

Calibration protocol summary

Step What it does Sample size Approx. time / cost
1. Fixed multi-round baselineRun all strategies N≥5 times each on full question set with logprobs=True; collect per-token CR + answer + ground truthN≥5 rounds × full benchmark~30-60 min · ~$0.50
2. Analyse calibration dataSpike patterns per outcome group; retrievable vs epistemic classification; per-cell best-strategy; oracle plateauUses Step 1 data~15 min local Python · $0
3. Design adaptive pipelineSet cell queues, stop criteria, and abstention threshold from Step 2 findingsUses Step 2 output~15 min
4. Validate adaptive runRun adaptive pipeline on held-out subset to confirm strategy-best transfer + lift~50-100 questions~15-30 min · ~$0.50
Total per model-benchmark pair~2 h · ~$1.50

All code, data, calibration protocols, and a 23-rule quality-assurance framework are released with the preprint.

Why it works

The commitment gap is a substrate-level phenomenon, the friction-ceiling pattern from Paper 1 §9.1b: retrieval succeeds but commit fails. The correct answer is statistically accessible, but the model commits to a marginally more probable wrong one. This is statistical structure, not arbitrary noise; it is the operational signature of race-resolution under bounded resources. A commitment gap can be closed by re-prompting the model under different conditions, which causes resampling and sometimes lands on the correct alternative. A genuine knowledge gap, where the correct answer has negligible probability, cannot.

The CR signal is free: it requires only logprobs=True in the API call. It needs no additional model calls, no external verifier, no labelled training data. Because CR measures a structural property of probabilistic token generation, it has so far transferred across the tested architectures, model families, and benchmarks; broader generality is hypothesised but not yet established outside the present panel.

Companion papers

Cite

Pødenphant Lund, T. (2026d). Friction-Guided Inference: A Free Signal That Improves Any Large Language Model [Preprint]. Zenodo. https://doi.org/10.5281/zenodo.20014121