Capacity Scaling of Encoding-Through-Loading: Application vs. Cloze Asymmetry Across Three Orders of Magnitude
Pødenphant Lund, T. (2026c) · Preprint · Live on Zenodo
Two task types on the same knowledge scale very differently across three orders of magnitude: cloze retrieval saturates by 8B parameters, while application scales monotonically from 2% at 0.5B to 85% at 70B (Spearman ρ = +1.000 on the Qwen2.5 ladder, n=5; see caveats below). The bottleneck migrates with capacity. “Learning” is not one thing.
| DOI | 10.5281/zenodo.20013491 |
| Target venue | PNAS Perspective / Journal of Memory and Language / Trends in Cognitive Sciences |
| Status | Preprint live; submission package consolidated |
| Length | ~5,925 words |
| Author | Tomas Pødenphant Lund [ORCID] |
TL;DR
Large language models solve two differentiable task types on the same underlying knowledge base. Cloze retrieval (recovering a fact as presented) saturates early: most models reach ~90% accuracy by 8B parameters. Application (chaining multiple facts into a derivation) scales monotonically across three orders of magnitude, from 2% at 0.5B to 85% at 70B.
The methodological move that makes this measurable is frontloaded in-context learning (ICL) on a single invented knowledge domain ("Zorbetik") designed to eliminate pretraining-prior confounds. When a model is presented with 47 facts about fictive chemical processes and asked to derive properties of a fictive substance, it cannot draw on prior exposure. What is measured is the substrate's ability to integrate and use just-presented information: encoding-through-loading rather than retrieval.
Three findings follow:
Finding 1 — Application scales monotonically with capacity: Spearman ρ = +1.000 on the Qwen2.5 sub-ladder (n=5, p = 0.0083 one-tailed); cross-family panel ρ = +0.92 (n=9, p = 0.0005); slope +40.8 percentage points per decade. Cloze does not show this pattern.
Finding 2 — The bottleneck migrates with capacity: at 0.5B, retrieval fails. At 14B, retrieval is saturated and 36% of questions show a "retrieval succeeds, derivation fails" pattern: the friction-ceiling signature at the encoding level.
Finding 3 — Mixture-of-Experts (MoE) models scale on active parameters, not total: a 235B MoE with 22B active parameters behaves on application tasks like a 22B dense model (active-parameter projection within 3pp of actual; total-parameter projection off by 22-33pp across two MoE models tested). This has direct implications for benchmark interpretation and architectural prediction.
The methodological claim: frontloaded ICL operationally substitutes for fine-tuning in encoding-to-retrieval studies. It is fast (~5 seconds per inference vs. hours for FT), cheap (cents vs. dollars), unified across model families, and produces dense friction data (six statistics per inference). Caveats: ICL is bounded by the context window and is ephemeral to each prompt, so FT remains necessary for large knowledge bases, persistence studies, and route-overwrite experiments.
Implications: capacity is single-axis but loads task types differentially. Cloze is indexing-bound; application is composition-bound. The three-dimensional friction framework (magnitude, distribution, rhythm) from Paper 1 decomposes the C-axis into corresponding operational handles.
Cloze vs application accuracy by model size
| Model | Total params | Active params | Cloze accuracy | Application accuracy |
|---|---|---|---|---|
| Qwen2.5-0.5B | 0.5B | 0.5B | ~12% | ~2% |
| Qwen2.5-1.5B | 1.5B | 1.5B | ~38% | ~9% |
| Qwen2.5-3B | 3B | 3B | ~71% | ~21% |
| Qwen2.5-7B | 7B | 7B | ~89% | ~38% |
| Qwen2.5-14B | 14B | 14B | ~91% | ~58% |
| Qwen2.5-32B | 32B | 32B | ~92% | ~73% |
| Qwen2.5-72B | 72B | 72B | ~91% | ~85% |
| Qwen3-30B-A3B (MoE) | 30B | 3B | ~78% | ~22% |
| Qwen3-235B-A22B (MoE) | 235B | 22B | ~92% | ~62% |
Highlighted rows: MoE models, where application accuracy tracks active parameter count (~22B-equivalent for the 235B MoE) rather than total parameters.
Companion papers
- Paper 1 (Friction Theory) — introduced the C-dimension prediction tested here
- Paper 0 (BFT) — biological instantiation; field-organised friction
- Paper 3 (Friction-Guided Inference) — uses commitment-gap signature for inference-time correction