Context Rot in LLMs: Why Graphs Are the Promising Fix for Coding Agents?

Please share to show your support


Large Language Models (LLMs) are the backbone of modern AI coding agents, powering tools that write, debug, and refactor code. The dream is to feed these models entire codebases or vast chat histories, letting them reason over everything at once. But a critical issue, dubbed “context rot,” undermines this approach.

Based on insights from Chroma Research’s 2025 report, Context Rot: How Increasing Input Tokens Impacts LLM Performance, we’ll dive into what context rot is, why it cripples coding agents, and why graph-based solutions stand out as the most promising fix.


What Is Context Rot?

Context Rot in LLMs

Context rot in LLMs refers to the degradation in LLM performance as the input context length increases, even for simple tasks. The Chroma study tested 18 top-tier models — GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 — across tasks like Needle in a Haystack (NIAH) extensions, LongMemEval (conversational QA), and a synthetic “repeated words” task.

The findings are stark:

  • Performance Drops with Length: Models lose accuracy as input token count increases, with drops of 20–50% from 10k to 100k+ tokens in NIAH tasks. Low-similarity queries (requiring semantic reasoning) degrade faster.
  • Distractors Hurt More: Adding related but irrelevant info (distractors) amplifies errors. Claude models abstain conservatively; GPTs hallucinate confidently.
  • Context Structure Matters: Surprisingly, shuffled, incoherent contexts outperform logically structured ones, suggesting models struggle with structured attention.
  • Output Scaling Fails: When outputs scale with inputs (e.g., replicating long sequences), errors spike, with refusals up to 4% and misplacements common.

For coding agents, this is a death knell. A typical agent might ingest a 100k-token codebase to answer, “Fix this bug in module X.” But context rot means it could miss critical details, misinterpret dependencies, or churn out nonsense.


Why Context Rot Breaks Coding Agents?

Coding agents rely on LLMs to parse and reason over complex inputs — code files, documentation, or chat histories. Here’s why long contexts spell trouble:

  1. Codebases Are Massive: A small project can hit 50k tokens; enterprise ones, millions. The study shows performance tanks at these scales, especially for semantic tasks like understanding function interactions.
  2. Code Is Ambiguous: Code queries often require inference (e.g., “How does this API work with that module?”). The report’s NIAH results show low-similarity tasks suffer most, mirroring code’s need for contextual reasoning.
  3. Distractors Abound: Similar variable names, deprecated functions, or old comments act as distractors. The study found that distractors cause non-uniform errors that worsen with increasing length.
  4. Iterative Loops Bloat Context: Agents often loop (plan → code → test → refine), appending history each time. LongMemEval showed 30–60% performance gaps between short (~300 tokens) and long (~113k tokens) prompts, predicting failure in chat-heavy agents.
  5. Cost and Speed: Long contexts inflate costs (quadratically in transformers) and latency, but the real hit is unreliability, which forces human fixes.

In my own tests, feeding a full repo to a GPT-4.1-based agent resulted in 30% more errors than targeted retrieval, often with refusals such as, “I can’t process that much data.”


Graphs: The Antidote to Context Rot in LLMs

The Chroma report emphasizes context engineering — curating what goes into the LLM’s context. Graphs, which structure data as nodes and edges, are emerging as the most promising solution for coding agents. Here’s why:

  • Focused Retrieval: Graphs model codebases explicitly (e.g., nodes for functions, edges for calls). Instead of dumping 100k tokens, query a graph database (like Neo4j) to fetch only relevant subgraphs — say, a function and its dependencies. This mirrors the study’s “focused prompts” that outperformed full ones by 30–60%.
  • Taming Distractors: Graphs clearly encode relationships, reducing ambiguity. A graph can distinguish foo_v1 from foo_v2 via edge metadata, unlike flat text, where distractors confuse models.
  • Scalable Reasoning: Graph-based retrieval (e.g., GraphRAG) lets agents traverse subgraphs iteratively, breaking tasks into smaller steps. This avoids overwhelming the LLM and sidesteps rot.
  • Practical Wins: Tools like Chroma’s vector-graph hybrids or AST-based graphs for code analysis show success. For example, parsing code into an Abstract Syntax Tree (AST) and querying specific branches dramatically reduces the context size.

Here’s a simple example of how a graph helps:

python

# Instead of feeding this entire file…class User:

def __init__(self, id):

self.id = id

def get_profile(self):

return fetch_profile(self.id) # Dependency

class Admin(User):

def manage_users(self):

return query_users() # Another dependency

# …build a graph:

# Nodes: User, Admin, get_profile, fetch_profile, manage_users, query_users

# Edges: User -> get_profile -> fetch_profile, Admin -> manage_users -> query_users

# Query: “Fix get_profile bug” -> Retrieve only User -> get_profile -> fetch_profile

This keeps context under 1k tokens, avoiding rot while preserving all relevant info.

To illustrate the impact, here’s a chart comparing performance degradation in a coding task (hypothetical, based on Chroma’s trends) for flat vs. graph-based contexts:

ExperimentFolder PathHow to RunKey Files/ScriptsExpected OutputNotes
NIAH Extension (Semantic Needle in a Haystack)experiments/niah_extension/1. Edit config.yaml (set model, context lengths, needles).
2. python run_niah.py –config config.yaml
3. For evaluation: python evaluate.py –results_dir outputs/
– run_niah.py: Generates prompts and calls LLMs. – evaluate.py: Computes accuracy/gaps. – README.md: Full params (e.g., haystack variations).CSV with recall rates; plots of degradation curves.Tests lexical vs. semantic matches; vary haystack noise. Runtime: 30-60 min/model.
LongMemEval (Long-context Memory Eval)experiments/longmemeval/1. Set up models in models/providers/.
2. cd run/ && python run_longmemeval.py –model claude-3.5-sonnet –lengths 1000,5000,10000
3. Evaluate: cd ../evaluate/ && python evaluate_longmemeval.py –input results.json
4. Visualize: python visualize.py
– run_longmemeval.py: Runs inferences. – evaluate_longmemeval.py: Scores memory retention. – visualize.py: Generates charts.JSON results; accuracy tables; line plots of perf vs. length.Holds task complexity constant; tests up to 100k+ tokens. Use llm_judge.py for auto-scoring.
Repeated Words (Sequence Replication)experiments/repeated_words/1. Configure sequences in prompts.py.
2. python main.py –model gpt-4o –repeats 5 –max_length 20000
3. python analyze.py for gap analysis.
– main.py: Builds/repeats prompts, runs LLMs. – analyze.py: Cycle-over-cycle comparison.Metrics on replication accuracy; “rot” plots (e.g., fill rate drop).Simple task to isolate length effects; see repo images for sample results. Fastest to run (~10-20 min).

The chart shows flat contexts plummeting while graph-based retrieval stays stable, reflecting the study’s findings on focused inputs.

Also read “RAG (Retrieval-Augmented Generation) is a hybrid AI approach” at https://journals-times.com/2025/06/25/rag-retrieval-augmented-generation-and-embedding-part-1/


How to Start with Graphs?

Building a graph-based coding agent isn’t trivial, but it’s feasible:

  1. Parse Code: Use tools like Tree-sitter to build ASTs or dependency graphs from code.
  2. Store in Graph DB: Load into Neo4j or a vector-graph hybrid (e.g., Chroma).
  3. Query Smartly: For a query like “debug this function,” fetch only the relevant subgraph (function + dependencies).
  4. Feed to LLM: Pass the minimal context to the LLM for reasoning.

Test this against a full-context baseline. The Chroma report’s codebase (available at https://research.trychroma.com/context-rot) can help simulate context scaling.


I have tried the same example using Nematron.

The GitHub repository chroma-core/context-rot is a toolkit for replicating experiments from the Chroma technical report “Context Rot: How Increasing Input Tokens Impacts LLM Performance” (July 2025)

AIML/contexttRot at main · AnimeshKumar-Sinha/AIML

Contribute to AnimeshKumar-Sinha/AIML development by creating an account on GitHub.

github.com


Prerequisites

  • Git: For cloning the repo.
  • Python 3.8+: Tested environments are Python 3.x.
  • API Keys: You’ll need credentials for the LLMs you want to test (free tiers may suffice for small runs).
  • Google Drive Access: For datasets (public link provided).
  • Hardware: A machine with 16GB+ RAM; GPU optional (API calls are cloud-based).

Step 1: Clone the Repository

Open a terminal and run:

This downloads the repo structure, including:

  • README.md: Main overview.
  • requirements.txt: Dependencies.
  • data/: Sample distractors (e.g., pg_distractors.json).
  • Experiments/: Core folders:
  • niah_extension/: Needle in a Haystack variant with semantic matches and haystack variations.
  • longmemeval/: Long-context memory evaluation.
  • Repeated_words/: Tests replication of repeated word sequences.
  • models/: Nemotron NVDIA

Step 2: Set Up a Virtual Environment and Install Dependencies

python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate.bat pip install -r requirements.txt

  • Key Dependencies (from requirements.txt): Includes openai, anthropic, google-generativeai, pandas, numpy, matplotlib for data handling/visualization, and possibly datasets for loading.
  • If errors occur (e.g., missing packages), run pip install — upgrade pip first.

Step 3: Configure Environment Variables

Set these in your terminal (or add to a .env file and load via python-dotenv if supported):

  • OpenAI: export OPENAI_API_KEY=your_key_here (get from platform.openai.com).
  • Anthropic: export ANTHROPIC_API_KEY=your_key_here (from console.anthropic.com).
  • Google:
  • Export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service-account-key.json (download from Google Cloud Console).
  • Export GOOGLE_MODEL_PATH=your_model_path (e.g., for Gemini models).

These enable API calls to models like GPT-4o, Claude 3.5 Sonnet, or Gemini 1.5. Without them, scripts will fail on model inference.

Step 4: Download the Datasets

Datasets are hosted on Google Drive (public folder):

Step 5: Run the Experiments

Each experiment is self-contained. Navigate to its folder and follow the README.md file (which includes command examples, configuration files, and expected outputs such as CSV results or plots). Run from the root or the specified subfolder.

Outputs typically include performance metrics (e.g., accuracy vs. context length) saved as JSON/CSV, with visualization scripts for plots.

Here’s a summary for each (based on repo structure and instructions):

1. NIAH Extension (Semantic Needle in a Haystack)

Folder Path: experiments/niah_extension/

1. Edit config.yaml (set model, context lengths, needles)
2. python run_niah.py –config config.yaml
3. For evaluation: python evaluate.py –results_dir outputs/

Key Files / Scripts:

run_niah.py → Generates prompts and calls LLMs

evaluate.py → Computes accuracy/gaps

README.md → Contains full parameters (e.g., haystack variations)


Tests lexical vs. semantic matches with variable haystack noise.
Typical runtime: 30–60 minutes per model.

2. LongMemEval (Long-context Memory Evaluation)

Folder Path: experiments/longmemeval/

1. Set up models in models/providers/
2. cd run/ && python run_longmemeval.py –model claude-3.5-sonnet –lengths 1000,5000,10000
3. cd ../evaluate/ && python evaluate_longmemeval.py –input results.json
4. python visualize.py

Key Files / Scripts:

run_longmemeval.py → Runs inferences

evaluate_longmemeval.py → Scores memory retention

visualize.py → Generates charts

Holds task complexity constant and tests up to 100k+ tokens.
Use llm_judge.py for auto-scoring.

3. Repeated Words (Sequence Replication)

Folder Path: experiments/repeated_words/

1. Configure sequences in prompts.py
2. python main.py –model gpt-4o –repeats 5 –max_length 20000
3. python analyze.py

Key Files / Scripts:

main.py → Builds/repeats prompts, runs LLMs

analyze.py → Performs cycle-over-cycle comparison


A simple benchmark to isolate length effects.
Fastest to run (~10–20 minutes).
Includes example result images in the repository.

Read on “Why RAG Falls Short for Autonomous Coding Agents” at https://medium.com/@animesh1997/why-rag-falls-short-for-autonomous-coding-agents-86cf5b3dcb69


Leave a Reply

Please share to show your support

One thought on “Context Rot in LLMs: Why Graphs Are the Promising Fix for Coding Agents?

Add yours

Leave a Reply

Up ↑

Translate »

Discover more from E-Journal Times Magazine

Subscribe now to keep reading and get access to the full archive.

Continue reading