Capacity Scaling of Encoding-Through-Loading

Paper 2 · Pødenphant Lund (2026c) · Read on Zenodo

I study language models to understand people.The smallest language model I tested (0.5 billion parameters) chains two new facts together into a new answer 2% of the time. The largest (70 billion) does it 85% of the time. On the same task, a 70-billion model is by contrast about as good at simply retrieving a single fact as a 7-billion one. So “learning” is not one thing. It is at least two, and the two scale with model size in completely different ways.

What it is about

This paper takes one very specific puzzle and uses it to open a much bigger one. The puzzle: when you give a language model new information, it can answer some kinds of questions about that information well and other kinds of questions badly. The pattern for which-kinds-well and which-kinds-badly is very lawful. And what that pattern tells us is that "learning" is not one thing. It is at least two things, and they scale with capacity in completely different ways.

Two task types, same knowledge

Take 47 facts about an invented domain. In the paper we use an invented subject called "Zorbetik" so the model cannot have seen any of it during training. Now ask two kinds of questions about those facts:

Cloze (retrieval): "What is the catalyst for the reaction described in fact 23?" The model just has to find and reproduce a fact. This saturates early: most models reach 90% accuracy at 8 billion parameters. Below 8B, performance climbs steeply; above 8B, it flattens out.
Application (chaining facts): "If you took the substance from fact 12 and exposed it to the conditions from fact 31, then re-ran the catalysis from fact 23 on the result, what would you get?" The model has to retrieve several facts and chain them into a new result. This scales monotonically: from 2% at 0.5 billion parameters to 85% at 70 billion. Spearman correlation across the Qwen2.5 ladder: ρ = +1.000. Perfect.

Same knowledge, different load. Cloze retrieval saturates at ~8B parameters; application scales monotonically across three orders of magnitude. The gap between them at any model size is the friction-ceiling pattern at the encoding level.

Same knowledge. Different load. Cloze is indexing-bound: if you have the fact stored, can you retrieve it? (Like looking up a phone number you already know.) Application is composition-bound: can you hold several facts in mind at once and combine them under load? (Like using several phone numbers to solve a riddle.)

That bigger models do better is itself old news: the neural scaling-laws work (Kaplan and the Chinchilla studies) already showed capability rising with size. What this adds is the clean split on the same knowledge base, where one task type saturates early and the other keeps climbing, and tying that split to how the knowledge is encoded.

What the test looks like. Three example facts from the invented Zorbetik knowledge base:

The currency of Zorbetik is the vren.
One vren is divided into 47 plinks.
Zorbetik's national anthem is performed only on Wednesdays.

Cloze question (indexing-bound): "What is the currency of Zorbetik?" → answer: "vren". One fact, one lookup.
Application question (composition-bound): "If a tourist arrives in Zorbetik on a Wednesday with 235 plinks, can he afford a souvenir costing 4 vren after hearing the national anthem?" → requires holding three facts in mind at once and combining them.
Same knowledge base. Different cognitive load. Models scale very differently on the two.

The bottleneck moves

The most interesting finding is not either curve on its own. It is what they say together: the bottleneck shifts as capacity grows.

At 0.5B parameters the model cannot even retrieve. Both cloze and application fail. Friction is everywhere: the model is overloaded by even basic indexing. The substrate does not have enough working room.

At 8B, retrieval is largely solved. Cloze is at 90%. But application still struggles, only ~40%. The model knows the facts but cannot chain them. We can see it directly: at 14B parameters, about 36% of the errors show a "retrieval succeeds, derivation fails" pattern. The substrate has the information but does not have working room to compose with it under load. This is the friction-ceiling pattern at the encoding level.

At 70B both are largely solved. Cloze 90%, application 85%. The substrate now has enough capacity to retrieve and compose. The bottleneck has lifted past where the task lives.

What this maps onto is a familiar pattern in human learning: a beginner can recite the formula but cannot apply it; an intermediate-level student can apply it in simple cases but breaks down under load; an expert applies it fluently. Same knowledge, three different performance levels, driven by where the substrate's bottleneck sits relative to the task's demands.

Why the invented domain matters

A common criticism of LLM experiments: "the model already knew the answer from its training data; you are just measuring memorisation." The Zorbetik design defeats this completely. Every fact in the domain is fictional. The names of the substances, the catalysts, the reaction conditions: all invented. The model cannot have seen any of it during training.

What this lets us measure is the substrate's raw ability to integrate and use information presented in the prompt: take this new knowledge, hold it, combine it, derive new conclusions from it. The performance numbers we report are clean, not contaminated by what the model already knew.

MoE models scale on active parameters, not total

A side finding with deployment implications: a 235-billion-parameter Mixture-of-Experts model with 22 billion active parameters performs on application tasks like a 22B dense model, not a 235B. The active-parameter projection lands within 3 percentage points of the actual performance; the total-parameter projection is 22–33 percentage points off.

This matters because MoE models are routinely benchmarked at their total parameter count. If you care about composition-bound tasks (reasoning, multi-step problem-solving, anything that requires holding several things in working memory), it is the active parameter count that matters.

A practical shortcut

The experiments use a technique called frontloaded in-context learning: instead of fine-tuning a model on the new knowledge (which takes hours per experiment, costs money, requires GPU access), we just put all the knowledge in the prompt and ask the question.

It is fast (~5 seconds vs hours), cheap (cents vs dollars), and uniform across model families (no model-specific fine-tuning recipes). It also lets us measure friction directly via per-token logprobs while the model produces the answer.

A note on credit: I came to this approach myself, out of frustration at how slowly fine-tuning ran. I later found out that others had used variants of frontloaded-context substitution before me, so I do not claim the technique is mine, only that it was the one practical way through this experiment series.

Caveats: ICL is limited by the context window and is ephemeral, so you have to supply the knowledge afresh each prompt. For very large knowledge sets, for persistence studies, and for route-overwrite experiments, fine-tuning is still necessary. (Paper 2B shows that it is not only a cost trade-off: ICL and FT instantiate fundamentally different memory regimes.)

Implications

For Friction Theory: capacity is one axis of friction, and this paper isolates it. The bottleneck migration with capacity gives us a clean window into how friction works at the encoding level.

For education science: the same knowledge encoded at different capacity levels supports different task types. A student who can do cloze cannot necessarily do application; a student who can do application cannot necessarily do far transfer. The gap between cloze success and application success is not motivation, and it is not a "knowledge gap" in the normal sense: the student has the knowledge. The gap is composition-bound computation under load. Different intervention.

For AI deployment: the active-parameter scaling result has direct implications for MoE benchmark interpretation. A 235B MoE benchmarked on retrieval looks 90%-accurate; on application it looks 22B-shaped. Pick the right benchmark for the right deployment.

What I do not know

Each scaling curve is measured on one model family at a time (the Qwen2.5 ladder, with a few checks on others). That the same capacity threshold sits in the same place across all architectures is a reasonable expectation, not something I have shown. Another family could have the bottleneck migration somewhere else.

And it is all measured with knowledge held in the prompt, not trained into the weights. Paper 2B shows that the two are not interchangeable, so the application threshold may sit differently when knowledge is fine-tuned in. That the same capacity gradient applies to biological learning is a conjecture, not a measurement.