Capacity Scaling of Encoding-Through-Loading: Application vs. Cloze Asymmetry Across Three Orders of Magnitude

Pødenphant Lund, T. (2026c) · Preprint · Live on Zenodo

Two task types on the same knowledge scale very differently across three orders of magnitude: cloze retrieval saturates by 8B parameters, while application scales monotonically from 2% at 0.5B to 85% at 70B (Spearman ρ = +1.000 on the Qwen2.5 ladder, n=5; see caveats below). The bottleneck migrates with capacity. “Learning” is not one thing.

DOI10.5281/zenodo.20013491
Target venuePNAS Perspective / Journal of Memory and Language / Trends in Cognitive Sciences
StatusPreprint live; submission package consolidated
Length~5,925 words
AuthorTomas Pødenphant Lund [ORCID]

TL;DR

Large language models solve two differentiable task types on the same underlying knowledge base. Cloze retrieval (recovering a fact as presented) saturates early: most models reach ~90% accuracy by 8B parameters. Application (chaining multiple facts into a derivation) scales monotonically across three orders of magnitude, from 2% at 0.5B to 85% at 70B.

The methodological move that makes this measurable is frontloaded in-context learning (ICL) on a single invented knowledge domain ("Zorbetik") designed to eliminate pretraining-prior confounds. When a model is presented with 47 facts about fictive chemical processes and asked to derive properties of a fictive substance, it cannot draw on prior exposure. What is measured is the substrate's ability to integrate and use just-presented information: encoding-through-loading rather than retrieval.

Three findings follow:

Finding 1 — Application scales monotonically with capacity: Spearman ρ = +1.000 on the Qwen2.5 sub-ladder (n=5, p = 0.0083 one-tailed); cross-family panel ρ = +0.92 (n=9, p = 0.0005); slope +40.8 percentage points per decade. Cloze does not show this pattern.

Finding 2 — The bottleneck migrates with capacity: at 0.5B, retrieval fails. At 14B, retrieval is saturated and 36% of questions show a "retrieval succeeds, derivation fails" pattern: the friction-ceiling signature at the encoding level.

Finding 3 — Mixture-of-Experts (MoE) models scale on active parameters, not total: a 235B MoE with 22B active parameters behaves on application tasks like a 22B dense model (active-parameter projection within 3pp of actual; total-parameter projection off by 22-33pp across two MoE models tested). This has direct implications for benchmark interpretation and architectural prediction.

The methodological claim: frontloaded ICL operationally substitutes for fine-tuning in encoding-to-retrieval studies. It is fast (~5 seconds per inference vs. hours for FT), cheap (cents vs. dollars), unified across model families, and produces dense friction data (six statistics per inference). Caveats: ICL is bounded by the context window and is ephemeral to each prompt, so FT remains necessary for large knowledge bases, persistence studies, and route-overwrite experiments.

Implications: capacity is single-axis but loads task types differentially. Cloze is indexing-bound; application is composition-bound. The three-dimensional friction framework (magnitude, distribution, rhythm) from Paper 1 decomposes the C-axis into corresponding operational handles.

Cloze vs application accuracy by model size

Model Total params Active params Cloze accuracy Application accuracy
Qwen2.5-0.5B0.5B0.5B~12%~2%
Qwen2.5-1.5B1.5B1.5B~38%~9%
Qwen2.5-3B3B3B~71%~21%
Qwen2.5-7B7B7B~89%~38%
Qwen2.5-14B14B14B~91%~58%
Qwen2.5-32B32B32B~92%~73%
Qwen2.5-72B72B72B~91%~85%
Qwen3-30B-A3B (MoE)30B3B~78%~22%
Qwen3-235B-A22B (MoE)235B22B~92%~62%

Highlighted rows: MoE models, where application accuracy tracks active parameter count (~22B-equivalent for the 235B MoE) rather than total parameters.

Companion papers

Cite

Pødenphant Lund, T. (2026c). Capacity Scaling of Encoding-Through-Loading: Application vs. Cloze Asymmetry Across Three Orders of Magnitude [Preprint]. Zenodo. https://doi.org/10.5281/zenodo.20013491