Research article

Counting as a Minimal Probe of Language Model Reliability

Large language models perform strongly on benchmarks in reasoning, coding, and long-context understanding, but these evaluations do not directly establish whether models can preserve procedural state over many steps. We introduce Stable Counting Capacity (SCC), a mechanical assay in which a model receives a homogeneous repeated-item sequence and is asked to return the exact count as a single integer. The measured counting capacity (CC) identifies the boundary at which exact rule execution ceases to be reliable.

Authors
Tianxiang Dai and Jonathan A. Fan
Year
2026
Materials
Paper, code, benchmark data, and prompt generator

Assay

SCC evaluates exact counting of repeated items under randomized sequence lengths, removing semantic landmarks, factual knowledge, and ambiguous scoring.

Measurement

Counting capacity is the largest sequence-length regime in which the model preserves the exact tally within a fixed reliability criterion.

Result

Across evaluated model variants, stable counting capacity remains far below advertised context limits and fails abruptly rather than gradually.

Interpretation

The results indicate bounded procedural state maintenance rather than open-ended rule execution or simple length estimation.

Stable Counting Capacity assay

The assay asks a minimal diagnostic question: when semantics and external knowledge are removed, how far can a model carry a rule-defined state through context?

a, a, a, a, a, a, a, a, a, a, ...

Return only the exact total count as a single integer.
  • Synthetic input with exact ground truth
  • Repeated items without semantic landmarks
  • Adaptive randomized ladder for estimating stable capacity

Empirical observations

The experiments separate long-context access from reliable procedural execution. Models can process long prompts while failing to preserve a simple count-defined variable through those prompts.

Stable count number line showing model counting capacity against nominal context length.
Measured CC values are frequently far below nominal context windows, indicating that context length alone does not imply reliable state maintenance.
Token efficiency, dual-task counting error, and reasoning comparison plots from the paper.
Increased token expenditure and reasoning-style generation do not reliably recover exact counts, while concurrent reasoning and coding tasks interfere with counting accuracy.
Boundary behavior, normalized predictions, attractor values, and character sensitivity in SCC evaluations.
Near the failure boundary, predictions shift from exact tracking to large errors and repeated attractor values, with performance sensitive to surface syntax.

Mechanistic interpretation

Behavioral and activation-level analyses suggest that successful counting is supported by finite, syntax-sensitive internal trajectories. While these trajectories remain organized, the model can decode the exact count. Once they collapse or cease to be decoder-usable, the model defaults to plausible numerical outputs rather than preserving the rule-defined state.

Prompt generator

This generator constructs simplified SCC-style counting prompts. It is intended for inspection and informal testing only; the paper estimates CC using an adaptive randomized ladder with repeated trials and randomized target lengths.

128 items