Vision-Language Models Assimilate Where Humans Contrast: A Cross-Architecture Signature of Contextual Computation

Paper 24 · Pødenphant Lund (2026) · Read on Zenodo

Vision-language models judge an item toward its context (assimilation) where human perception judges it away (contrast). Reading the per-token competing-routes margin as an instrument, classical inferential illusions recur in these models as contextual modulation with a measurable commitment signature, but the modulation runs with a substrate-specific sign. A dose-response sweep replicates the anti-human direction across three architecturally distinct vision encoders on both size and brightness; every slope's 95% confidence interval excludes zero.

DOI (concept)	10.5281/zenodo.20678296
Status	v1 live on Zenodo (2026-06-17)
Venue	TMLR (Transactions on Machine Learning Research)
Author	Tomas Pødenphant Lund [ORCID]

TL;DR

Which "cognitive biases" and "perceptual illusions" are necessary signatures of a bounded decision architecture, and which are substrate accidents? Behavioural data cannot say, because the experimenter must infer a latent decision variable from aggregated choices. The per-token competing-routes (CR) substrate of a language model, where commitment dynamics are observable directly in logprobs, is an instrument that can sort a candidate phenomenon into phantom (an observer-aggregation artefact), architecture-forced (a real bounded-commitment signature), or cruft (a substrate accident).

Applied to perception via vision-language models as a no-retina visual substrate, classical inferential illusions recur as contextual modulation with a commitment signature, but with a substrate-specific sign: VLMs modulate the target toward its surround (assimilation) where human perception modulates it away (contrast). The divergence is local and graded. A dose-response sweep shows the judged target tracking the surround monotonically, opposite in sign to the human percept, on the size illusions (Ebbinghaus, Delboeuf) and on brightness (simultaneous-contrast).

The effect replicates across three distinct vision encoders (native-ViT, SigLIP, CLIP) on both dimensions: nine dose-responses, every slope's 95% CI excluding zero. The opposite sign is strong evidence against the simplest account, that VLMs imitate human illusion reports absorbed from text, which would reproduce the human direction. A genuinely architectural cause and a learned-from-training-statistics cause remain the two live hypotheses.

The catalogue problem

The catalogue of "cognitive biases" and "perceptual illusions" conflates three different things: genuine mechanisms of bounded decision-making; phantom effects that are artefacts of how an experimenter aggregates behaviour (Ott, Masset & Gouvêa 2022 show a reported sunk-cost effect in humans, mice and rats is reproduced by a rational agent with no sunk-cost term); and genuine substrate accidents (the retinal blind spot, after-images, Mach bands). Behavioural data alone cannot dissociate the first two, because the experimenter must infer latent cognitive variables from aggregated choices.

A language model changes the epistemics. The per-token competing-routes signal, read directly from logprobs (Pødenphant Lund 2026b), makes the latent commitment dynamics observable. The CR substrate is therefore not merely another system in which to look for biases; it is an instrument that can run a sort behavioural data cannot. This paper states that sorting criterion, refines it onto a measurable axis, and demonstrates it in perception, a modality outside the decision-bias catalogue, using vision-language models as a no-retina visual substrate.

The sorting criterion

For any candidate phenomenon X, three tests:

Recurrence (substrate). Does X appear in two or more mechanism-disjoint optimisation lineages — the biological brain (shaped by evolution) and a gradient-trained network (shaped by SGD)? Recurrence across disjoint lineages is convergent evidence that X is a forced solution rather than a historical accident.
Cost-boundary (regime). Does X's magnitude track the cost/benefit boundary, present where it is near-optimal or cheap to show, vanishing where showing it is costly? Tracking the boundary is the signature of an architecture consequence rather than a fixed quirk.
Rational-null (mechanism, the Ott test). Does a rational agent with no X-mechanism reproduce X's signature via aggregation, attrition or regression-to-the-mean? If yes, X is phantom. If no, X requires a genuine commitment mechanism, visible in the substrate margin (CR/logprobs), and X is real.

The three outcomes: (A) phantom (no mechanism; observer statistics); (B) architecture/RACE (recurs, tracks the cost boundary, requires a real commitment visible in CR); (C) cruft (single-lineage, cost-invariant, substrate-specific). What the CR substrate adds over a behaviour-only test is that the latent commitment margin is directly observable rather than inferred, making the phantom-versus-real dissociation direct and fine-grained.

Two axes, not one partition

"Cruft" conflates two independent axes, and separating them sharpens the criterion. Axis 1 is the measurable one: does the phenomenon have a decision-CR / commitment signature in the substrate margin? Axis 2 is a lawfulness question argued from physics: is the phenomenon forced (by thermodynamic, developmental or efficient-coding constraints) or a free accident? The two do not co-vary. The blind-spot gap has no decision-CR (axis-1 cruft) yet is physically forced (axis-2 lawful). Bucket C therefore splits into C1, physics-forced non-commitments (the gap, after-images, Mach bands), and C2, near-free accidents, which appear to be nearly empty. This places the falsifiability boundary on the measurable axis, not on the hard-to-establish "free accident."

Method — the substrate instrument

Competition readout. Per-token CR (the count of competing routes above a probability threshold), the top-1−top-2 commitment margin (nats), and effective routes = exp(Shannon entropy) as the monotone competition measure across the base/instruct format divide.
The gate (informativeness as a blocking pre-flight). A real-condition staircase — can the model judge a genuine difference at all? — is run first over the full counterbalance grid. A cell that fails the gate is inconclusive about the instrument, not evidence that the illusion is cruft.
Prior orthogonalisation. Forced two-alternative choices on a language substrate carry a large first-mentioned-option / position prior. The response is bound to A/B letter tags drawn in the image, counterbalanced over tag-side and mention-order, which moves the order prior onto labels that average out of the spatial signal.
Matched precision. All models are read at the same 4-bit (nf4) precision; contrasts are within-model only.

Results — perception

The mechanism: assimilation, not contrast

A single generator accounts for the divergence. The VLM modulates the target toward the size or intensity of its immediate surround (assimilation) where human perception modulates it away (contrast). It diverges from the human percept wherever the human effect is contrastive: in simultaneous-contrast a light surround makes humans see darker while the VLM commits lighter; in Ebbinghaus, large surrounders make humans see a smaller centre while the VLM sees a larger one; in Delboeuf, a large surrounding ring makes humans see a smaller centre while the VLM sees a larger one. The generator is specifically the assimilation of an enclosed target toward its surround, robust on exactly the enclosed-target illusions. Assimilation versus contrast is a classical dichotomy in human vision (Shapley & Reid 1985); the VLMs tested sit systematically on the assimilation side.

The dose-response across three architectures and two dimensions

For each illusion the context magnitude is swept in five levels and the slope of the baseline-corrected judgment is read, signed to the swept side. Assimilation predicts the judged target tracks the context magnitude monotonically with the opposite sign to the human percept (a positive slope). The result holds across three architecturally distinct families, each spanning a different vision encoder and a different language model: Qwen2-VL-7B (native dynamic-resolution ViT + Qwen2), Idefics2-8B (SigLIP + Mistral), and Phi-3.5-Vision (CLIP ViT-L/14 + Phi-3.5), at n ≈ 24 per cell.

On the size illusions the baseline-corrected judgment rises monotonically with context size (bootstrap 95% CIs over 4000 resamples, all excluding zero): Ebbinghaus +0.22 [+0.14, +0.30] (Qwen) / +0.33 [+0.26, +0.39] (Idefics2) / +0.27 [+0.23, +0.32] (Phi); Delboeuf +0.39 [+0.34, +0.44] (Qwen) / +0.45 [+0.42, +0.48] (Idefics2) / +0.43 [+0.41, +0.45] (Phi). The Delboeuf result is the sharpest test: where the human curve flips positive-to-negative as the ring widens (Yang & Schwaninger 2011; Urale & Schwarzkopf 2023), the VLM runs negative-to-positive, the inverse of the human scale-flip rather than its absence.

The same assimilation holds on a second perceptual dimension, brightness. Using the textbook simultaneous-contrast configuration with pure local induction and no global-region confound, all three families commit the target toward its surround's luminance (a light annulus yields lighter, where human simultaneous-contrast yields darker): Qwen2-VL +0.37 [+0.32, +0.42], Idefics2 +0.13 [+0.08, +0.18], Phi-3.5-Vision +0.50 [+0.48, +0.52]. Across both dimensions and all three encoders the picture is uniform: nine dose-responses (3 encoders × 3 illusions), every slope's 95% CI excluding zero. The sign is robust to lower capability within a family (a 2-billion-parameter same-family model replicates the same anti-human sign at smaller magnitude).

A control on scope

A fourth classical figure, the Müller-Lyer illusion, serves as a control on the generator's scope. On the one model whose real-condition gate it passes (Qwen2-VL-7B judges genuine line-length differences at accuracy 1.00), the Müller effect is null (+0.004) at adequate n. The assimilation generator is therefore specific to enclosed-target size and brightness configurations; the Müller fin geometry, which is not an enclosed-target configuration, does not engage it.

The sort applied to perception

Inferential illusions (Ebbinghaus, simultaneous-contrast, Delboeuf) sort into bucket B (architecture/RACE): they recur across three encoders and produce a decisive commitment signature in the margin, dose-dependent on the context magnitude, but with a substrate-specific sign (assimilative, not contrastive) whose source — a genuine architectural bias versus learned training-data statistics — is left open. The bucket-B assignment rests on the measurable axis (recurrence plus commitment signature); it does not by itself settle the architectural cause. Retinal illusions (after-images, Mach bands, the blind-spot gap) are C1: they cannot even be presented to a no-retina substrate, and the asymmetry is itself the recurrence test.

Discussion

The headline is not "VLMs have human illusions." It is the reverse and sharper: the VLMs' contextual judgments go systematically opposite to the human percept (assimilation, not contrast), producing human-direction illusions only where assimilation and contrast happen to agree. This bears directly on the imitation objection that haunts any "LLMs have biases" claim. A model that had merely absorbed human illusion descriptions from text should reproduce the human direction; one that diverges in a lawful, graded way is not echoing those reports. The substrate-sorting frame is what turns this from a curiosity into a result.

Architecture or learned statistics?

Ruling out imitation-of-reports does not establish that the assimilation is forced by the architecture. A third hypothesis is live: the models may have learned co-occurrence statistics from web-scale image-caption data in which a target's described size or brightness correlates with its surround. Two observations weigh against, without deciding, a purely dataset-driven account. The effect replicates across three models trained on different corpora, where a corpus-specific regularity would be expected to vary. And the effect is a graded, monotone function of a low-level geometric or photometric parameter on synthetic stimuli unlike any caption-paired natural image, which reads more naturally as an encoding-level computation than as retrieval of a learned caption regularity. Disentangling the two cleanly would require training a VLM from scratch on data with controlled surround-target statistics, flagged as the decisive future experiment.

Why the sign is inverted: subtractive hardware versus additive attention

The sign of the deviation is a candidate fingerprint of which contextual operation the substrate runs. Human contextual contrast is classically a subtractive computation rooted in lateral inhibition: centre-surround receptive fields suppress one another, exaggerating the difference between a target and its surround. A vision-language model has no retina and no lateral inhibition. Its vision encoder mixes patch features by self-attention, whose core operation is a weighted average (an integrative pooling). The hypothesis is that this biases the default contextual operation toward integration, pulling a target patch toward its neighbours, which is assimilation. It is compounded by the training objective: VLMs are optimised to describe and answer about whole scenes, a task that rewards holistic context integration. This is a hypothesis about the net contextual bias these trained encoders exhibit, not a claim that attention is incapable of subtraction. It is testable: if assimilation arises from attention-blending of neighbouring patches, it should be stronger the closer the context is to the target, and a substrate with stronger local or edge-selective processing should assimilate less.

Connections to other papers in the series

Paper 1 (Friction Theory) — the substrate-universal framework whose competing-routes account argues biases are architectural necessities of any race system; this paper supplies the falsifiable sort that account lacked.
Paper 0 (BFT) — the biological specialisation; the additive-over-subtractive asymmetry connects to the presence-cheap, absence-costly reading of the field architecture.
Paper 13 (Operational Friction Theory) — race-opening and commitment dynamics; the order and position effects discussed here read as race-onset effects on the temporal axis.

Read the paper

The full paper is on Zenodo (concept DOI 10.5281/zenodo.20678296):

Pødenphant Lund, T. (2026). Vision-Language Models Assimilate Where Humans Contrast: A Cross-Architecture Signature of Contextual Computation. Zenodo. https://doi.org/10.5281/zenodo.20678296

Read on Zenodo → · Plain English version · Dansk version