Friction as the Cost of Probabilistic Computation
Paper 1 · Pødenphant Lund (2026b) · Read on Zenodo
A slime mould has no brain, no nerves, not a single cell that looks like a brain cell. Saigusa and colleagues showed in 2008 that it can still learn to expect something: expose it to cold at regular intervals, and it starts to slow down before the cold arrives, as if it is counting on it. The same underlying pattern, that choosing between options costs something, turns up in slime mould, in brains, and in language models. This paper builds the formal account and tests it on 15 different language models. Seven matching signatures show up across all of them.
What it is about
The cost of choosing is not biological. It is mathematical. Resolving competing candidates costs something in any system that picks between alternatives under finite resources, not just in brains. That makes behavioural friction a special case of something more general. Behavioural Friction Theory was about biological systems; here it is lifted into a broader framework, Friction Theory (FT), where BFT is the special case: BFT ⊂ FT.
Why bother generalising? Because the underlying principle, that resolving competing candidates costs something, is not specifically biological. It is mathematical. It holds for neural networks. It holds for chemical kinetics. It may hold all the way down to quantum measurement (that is what Paper 10 investigates). If you have a race architecture, you have friction. If you have friction, you have a measurable cost. And from that cost a great many predictions follow.
The formal foundation
Friction is formally connected to thermodynamic free energy through Ortega & Braun's (2013) bounded-rational decision-making framework. This is not an analogy or a metaphor. It is the same mathematics as statistical mechanics.
For language models this connection is especially precise. The softmax function in a transformer's output layer is not "inspired by" the Boltzmann distribution. It is the Boltzmann distribution. The temperature parameter in sampling is not "similar to" temperature in physics. It is the same parameter. Token choice in auto-regressive language models is bounded-rational decision-making in Ortega & Braun's sense, exactly. The mathematical inheritance is direct.
That gives us a measurable quantity: Competing Routes (CR). CR counts how many candidate tokens were within reach at each position in the model's output. High CR means the model was weighing many alternatives. Low CR means the model was committed to one. CR comes for free from any language model's API when you ask for logprobs=True. It correlates with model errors. It changes systematically across architectures. It is the operational handle that makes the whole framework empirically testable.
Empirical test: 15 architectures, seven signatures
The theory has been tested empirically on 15 different language-model architectures ranging from 0.5B to 405B parameters: dense transformers, mixture-of-experts, State Space Models, Liquid Neural Networks, base models, instruction-tuned models. Seven cross-architecture signatures were found:
- 1/e secretary-problem optimum: base language models naturally converge on roughly 1/e ≈ 36.8% sample-then-commit timing on iterative tasks. Not because they were trained on the secretary problem; because 36.8% is the friction-optimal exploration rate for any race-architecture system with a finite horizon. RLHF then pushes this number around.
- Parse-vs-generate phase decomposition: the friction signal decomposes into a "parse" phase (the model reads the question) and a "generate" phase (the model produces the answer). The two phases scale differently with task type and model size.
- Constructive vs destructive friction: sometimes friction signals genuine uncertainty that more thinking can resolve (constructive); sometimes it signals that the model is fundamentally confused and that more thinking will just commit to the wrong answer faster (destructive). The distinction is empirically detectable.
- Friction profiles as cognitive fingerprints: each architecture has a characteristic friction pattern across task types. Two models can score identically on a benchmark yet have completely different friction fingerprints, and that difference predicts which kinds of interventions will help them.
- Mode-shift entry and exit costs: switching from one reasoning mode to another costs friction. Cohen's d = 0.83-0.88 on instruction-tuned models; zero or reversed on matched base models. This is a sharp finding: the mode-shift cost is an RLHF artefact, not a substrate property.
- Reactance as thermodynamic hysteresis: instruction-tuned models show reactance (instructions can backfire) that tracks the intensity of their RLHF training. The more aligned the model is, the more it pushes back against instructions in measurable ways.
- Trailing-task forgetting under load: the strongest cross-model effect, Cohen's d = 1.2. When one task comes after another high-load task, performance on the second task degrades in a precisely predicted way.
Three friction dimensions, found everywhere
Principal Components Analysis across all 15 architectures shows that friction has exactly three independent dimensions: magnitude, distribution, and rhythm.
- Magnitude: how much friction there is in total
- Distribution: whether friction is concentrated or spread out
- Rhythm: the temporal pattern of friction across the output
The first dimension (magnitude) is practically identical across all architectures: Spearman's ρ = 0.95 cross-architecturally. That is a striking finding. It means the three-axis decomposition is not a property of any specific model or any specific training procedure. It is a property of the race architecture itself. The same architecture, instantiated in 15 different ways, produces the same three-axis decomposition.
BFT is a subset of FT
The relationship between the two papers is precise: BFT ⊂ FT. BFT's four fields (Safety, Meaning, Competence, Effort) arise when three further biological constraints are added: mortality, mobility, metabolism. Non-biological race systems exhibit friction without fields. The presence of friction is universal across substrates; its organisation into four behavioural fields is specifically biological.
This is testable. Language models, which have none of the three biological constraints, show friction (measurable as CR) but not field-organised friction. The cross-architecture data are consistent with this prediction across all 15 architectures studied.
How far does it reach?
Cross-substrate data from slime mould (Saigusa et al. 2008's anticipatory conditioning), C. elegans, flies, octopuses, and human brains place language models in a six-substrate gradient. Same architecture, varying substrates, similar phenomena. How far the theory reaches (whether it extends down to quantum systems and up to economic markets) is an open empirical hypothesis the paper does not settle. Paper 10 tests the physics-downward direction explicitly.
What this paper enables
FT is the theoretical anchor the other papers build on:
- Paper 2 tests the capacity axis of friction empirically across 0.5B–70B parameter models
- Paper 3 uses CR as a free signal to improve language models by +12–21 percentage points
- Paper 2B shows that CR collapse during fine-tuning is the structural cause of confident hallucination
- Paper 5 uses FT's three dimensions plus BFT's four fields to build a substrate-grounded emotion taxonomy
- Paper 10 extends FT's mathematical scaffolding to physics-scope substrates
- Paper 13 specifies the operational mechanism for how friction is resolved
You will find the full technical detail in the English version: Paper 1 (English technical). The full paper is on Zenodo: DOI 10.5281/zenodo.20012654.