You Control Two of Three Context Degradation Factors
I built an agent system that runs 50+ turn sessions. A task queue fires overnight, each task spawning its own multi-turn conversation. Context degrades. Everyone who has run agents at this scale knows the 30-50K token wall — the point where instructions start getting missed, YAML gets malformed, reasoning gets sloppy.
But I wanted to know something more specific: does the quality of what’s in that context matter, or is it purely a function of how much? I dug into the literature. The answer changes how you engineer context.
The Three-Factor Decomposition
I want to be upfront: the decomposition I’m about to present is my synthesis across multiple papers, not a framework from any single study. Du et al. (2025) cleanly separates the first two factors. The third comes from different experimental settings entirely. Nobody has tested all three together. That said, the decomposition is useful because it tells you exactly where to spend your engineering effort.
Factor 1: Length Tax (13.9-50% performance drop). This is the architectural floor. Even with perfect retrieval and zero distractors, performance drops as context gets longer. Du et al. (2025, “Context Length Alone Hurts LLM Performance Despite Perfect Retrieval”) ran the cleanest experiment I’ve seen on this. They tested three conditions: essays as distractors, whitespace padding, and masked (meaningless) tokens. The masked condition is the key — no semantic content at all, just positional filler. Performance still dropped 13.9-50% depending on the task.
This is positional encoding drift and attention dilution at work. The softmax denominator grows, each token’s attention share shrinks, and position encodings move further from their training distribution. You cannot fix this. It’s baked into the transformer architecture. Wang et al. (2026, “Intelligence Degradation in Long-Context LLMs”) found that catastrophic degradation kicks in at roughly 40-50% of a model’s maximum context length — a sharper cliff than the gradual decline most people expect.
Factor 2: Distraction Penalty (15-37 percentage points additional drop). This stacks on top of the length tax. Du et al.’s essay-distractor condition performed significantly worse than the masked-token condition at every context length tested. The difference — 15 to 37 percentage points depending on the task — is pure distraction cost. Irrelevant meaningful content is worse than irrelevant meaningless content.
This matches Chroma’s “Context Rot” research (2025), which found a counter-intuitive result: shuffled haystacks outperform coherent ones. When the distractor passages are scrambled and incoherent, they’re less distracting than well-formed prose that the model tries to integrate. The mechanism makes sense once you think about it — semantically coherent distractors compete for attention more effectively than gibberish. Chroma also found that needle-question similarity accelerates the decline. The closer a distractor is to the actual query topic, the worse it gets.
Liu et al. (2023, “Lost in the Middle”) documented the positional dimension of this: even when relevant information is present, its position in context determines whether the model uses it. The U-shaped retrieval curve means middle-positioned information gets systematically underweighted. Distractors don’t just add noise — they push relevant content into worse positions.
Factor 3: Organization Penalty (5-36% performance swing). How you format and structure context matters. He et al. (2024, “Does Prompt Formatting Have Any Impact on LLM Performance?”) showed that formatting changes alone — Markdown headers, XML tags, structured layouts — swing performance by up to 40% on smaller models and 5% or more on frontier models. Anthropic’s own research on long-context prompting (2024, “Prompt engineering for long context”) found that adding scratchpads and examples reduced errors by 36%.
Here’s where I have to hedge explicitly: the Factor 3 numbers come from short-context studies. Nobody has run a structured-vs-unstructured comparison at 50K+ tokens. The interaction between organization and length is unmeasured. My intuition — based on running hundreds of long sessions — is that structure matters more at long context, not less. But I haven’t proven it, and neither has anyone else.
Why the Decomposition Matters
If you treat context degradation as a single phenomenon — “models get worse with more tokens” — your only lever is keeping context short. That’s a blunt instrument. The three-factor view gives you a sharper diagnosis.
The length tax is the only factor you cannot reduce. It’s the cost of using a transformer at scale. Accept it as a floor.
The distraction penalty is the factor with the most room for improvement. Every irrelevant document, stale conversation turn, and abandoned reasoning path that remains in context is actively degrading performance — not just by taking up space, but by competing for the model’s attention. Filtering it out doesn’t just save tokens. It removes semantic interference.
The organization penalty is the cheapest factor to address. Restructuring your context costs almost nothing at inference time but can recover significant performance. XML section markers, YAML structured data, clear headers — these aren’t cosmetic. They’re load-bearing.
What This Looks Like in Practice
I run two systems where this plays out daily.
cc-remote is my overnight agent system. A task queue with 50+ entries fires sequentially, each task spawning a multi-turn Claude session. Early versions accumulated everything — full conversation history, all intermediate results, verbose error logs. Sessions that started strong would degrade predictably around the 35-40K token mark. The fix wasn’t reducing the number of tasks. It was aggressive context pruning between tasks: strip completed task context down to a structured result summary, discard intermediate reasoning, keep only what the next task needs.
cc-live-brief accumulates session briefs from 150+ turn interactive sessions. I noticed that sessions running with clean, structured context — what I call SSDM format (structured session data with YAML frontmatter and XML section markers) — produce measurably better handoff briefs than sessions with messy, unstructured context at the same token count. Same model, same prompts, same task complexity. The difference is what’s in the context and how it’s organized.
The practical moves in my system:
- Aggressive pruning — completed task results get compressed to structured summaries. Intermediate reasoning is discarded. This attacks Factor 2 directly.
- YAML structured briefs — session state is captured in typed, structured format rather than prose narrative. This attacks Factor 3.
- XML section markers — context is explicitly sectioned so the model can navigate it. Headers aren’t for humans reading the prompt. They’re for the attention mechanism.
- Semantic filtering — when context approaches the budget, I don’t just truncate oldest-first. I remove content that’s semantically distant from the current task. Du et al.’s finding that semantic similarity makes distractors worse means the filtering criterion matters.
The Experiment Nobody Has Run
The obvious next study: take a fixed set of tasks, hold total token count constant, and vary (a) distractor content vs. clean padding, (b) structured vs. unstructured formatting, and (c) their interaction — all at 50K+ tokens. A 2x2 factorial at long context.
Du et al. did (a) beautifully. He et al. did (b) at short context. Nobody has done both together at the scale where it matters for production agents. The interaction term is the interesting part. Does structure help more when distractors are present? Does filtering matter less when context is well-organized? I don’t know. I suspect the effects are super-additive — that structured, clean context at 50K tokens performs dramatically better than unstructured, noisy context at the same length. But suspicion isn’t evidence.
Takeaway
The next time your agent’s output degrades at 40K tokens, don’t blame the model. Check what’s in those 40K tokens. Two of the three degradation factors are under your control. The length tax is real and irreducible — but the distraction penalty and the organization penalty are engineering problems with engineering solutions.
References
- Du, Y., Tian, M., Ronanki, S., Rongali, S., Bodapati, S., Galstyan, A., Wells, A., Schwartz, R., Huerta, E. A., & Peng, H. (2025). “Context Length Alone Hurts LLM Performance Despite Perfect Retrieval.” arXiv:2510.05381
- Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172
- Chroma. (2025). “Context Rot.” trychroma.com/research/context-rot
- He, J., Rungta, M., Koleczek, D., Sekhon, A., Wang, F. X., & Hasan, S. (2024). “Does Prompt Formatting Have Any Impact on LLM Performance?” arXiv:2411.10541
- Anthropic. (2024). “Prompt engineering for long context.” anthropic.com/news/prompting-long-context
- Wang, W., Min, J., & Zou, W. (2026). “Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination via Natural Length Distribution Analysis.” arXiv:2601.15300