Assay
SCC evaluates exact counting of repeated items under randomized sequence lengths, removing semantic landmarks, factual knowledge, and ambiguous scoring.
Research article
Large language models perform strongly on benchmarks in reasoning, coding, and long-context understanding, but these evaluations do not directly establish whether models can preserve procedural state over many steps. We introduce Stable Counting Capacity (SCC), a mechanical assay in which a model receives a homogeneous repeated-item sequence and is asked to return the exact count as a single integer. The measured counting capacity (CC) identifies the boundary at which exact rule execution ceases to be reliable.
SCC evaluates exact counting of repeated items under randomized sequence lengths, removing semantic landmarks, factual knowledge, and ambiguous scoring.
Counting capacity is the largest sequence-length regime in which the model preserves the exact tally within a fixed reliability criterion.
Across evaluated model variants, stable counting capacity remains far below advertised context limits and fails abruptly rather than gradually.
The results indicate bounded procedural state maintenance rather than open-ended rule execution or simple length estimation.
The assay asks a minimal diagnostic question: when semantics and external knowledge are removed, how far can a model carry a rule-defined state through context?
a, a, a, a, a, a, a, a, a, a, ...
Return only the exact total count as a single integer.
The experiments separate long-context access from reliable procedural execution. Models can process long prompts while failing to preserve a simple count-defined variable through those prompts.
Behavioral and activation-level analyses suggest that successful counting is supported by finite, syntax-sensitive internal trajectories. While these trajectories remain organized, the model can decode the exact count. Once they collapse or cease to be decoder-usable, the model defaults to plausible numerical outputs rather than preserving the rule-defined state.
This generator constructs simplified SCC-style counting prompts. It is intended for inspection and informal testing only; the paper estimates CC using an adaptive randomized ladder with repeated trials and randomized target lengths.