The C³ Context Cost Curve

A practical framework for context engineering in LLM systems

Most teams building with large language models share the same instinct: If the model makes mistakes, add more context.

Early on, this works. Accuracy improves. Hallucinations drop. Demos get better.

Then something strange happens. Latency spikes. Outputs become inconsistent. Debugging turns into guesswork. And despite adding more context, the model performs worse.

This post introduces a practical framework for understanding why this happens, and provides concrete guidelines for context engineering based on conversations with practitioners building real-world LLM systems.

Context is not free

In most AI systems, context is treated as an unpriced resource. System prompts, conversation history, retrieved documents, user memory, and tool outputs are all stuffed into a single context window with the assumption that the model will sort it out.

In reality, context has real costs:

  • Token cost (paid on every request)
  • Latency (often tail latency)
  • Failure surface area (hallucinations, contradictions, priority inversion)

The most reliable AI systems do not maximize context. They optimize it.

What counts as context?

Context is everything the model sees before it produces an output, including:

  • System instructions
  • Output schemas
  • Retrieved documents
  • User memory
  • Conversation history
  • Tool outputs
  • Agent scratchpads

The most common mistake teams make is treating all of this as interchangeable text. It is not.

Introducing C³: Cold, Warm, and Hot Context

The C³ framework classifies context based on two properties:

  1. How often it changes
  2. How authoritative it should be

Cold Context

Stable, rarely changing, highest authority

Examples include:

  • System prompts
  • Policies
  • Output schemas
  • Hard constraints

Cold context defines the rules of the system. If it is wrong, the failure can be catastrophic.

Best Practices:

  • Keep it minimal and surgical
  • Prefer schemas and validators over prose
  • Avoid long explanations and too many examples

Anti-pattern:

Pasting an entire policy document into the system prompt.

Warm Context

Slowly changing, moderately authoritative

Examples include:

  • User preferences
  • Long-term memory
  • Project or workspace metadata
  • Summarized conversation history

Warm context provides continuity, but it comes with risks:

  • It grows without bounds
  • It drifts over time
  • It is often stale or partially incorrect

Best Practices:

  • Retrieve it dynamically
  • Summarize aggressively
  • Label it as advisory rather than binding
  • Track freshness and confidence

Anti-pattern:

Appending full conversation history indefinitely.

Hot Context

Ephemeral, volatile, lowest authority

Examples include:

  • Current user input
  • Tool outputs
  • Agent scratchpads
  • Intermediate reasoning artifacts

Hot context changes on every request and has the highest entropy. It is also the most likely to be wrong.

Best Practices:

  • Enforce hard token caps
  • Use structured tool outputs
  • Automatically expire data
  • Never blindly trust tool results

Anti-pattern:

Dumping every tool response back into the prompt.

The Context Cost Curve

The key insight behind C³ is that context value grows sub-linearly, while context cost grows linearly or worse.

Early context additions dramatically improve accuracy. Past a certain point, returns diminish. Beyond that point, accuracy often declines.

This happens for several reasons:

  • Attention dilution
  • Conflicting signals
  • Implicit priority inversion
  • Larger hallucination surface area

Different context types hit this inflection point at different times:

  • Cold context plateaus early
  • Warm context later
  • Hot context very early, often followed by a sharp decline

This relationship is the Context Cost Curve.

The C³ Context Stack

Not all context deserves equal weight. Reliable systems assemble context deliberately, from the highest authority to the lowest:

  1. Task instructions (cold)
  2. Output schemas and constraints (cold)
  3. Retrieved facts (warm)
  4. User or project memory (warm)
  5. Recent conversation (hot)
  6. Tool outputs and scratchpads (hot)

As you move down the stack:

  • Authority decreases
  • Volatility increases

A simple rule follows: lower layers should never override higher layers. Most hallucinations are violations of this rule.

Operationalizing C³

Teams that take context seriously do a few simple things consistently:

  • Tag context blocks as cold, warm, or hot
  • Log context composition on every request
  • Track accuracy relative to context size
  • Enforce explicit token budgets
  • Continuously summarize or decay warm context

In practice, most reliability issues in LLM systems do not come from model choice or prompt wording. They come from treating context as an unlimited buffer rather than a constrained design surface.

Once teams become explicit about what information deserves to be present, how long it should live, and how much authority it carries, much of the complexity disappears. Context stops being a source of surprises and becomes something you can reason about, measure, and improve.

If you are building an AI product and finding that reliability, latency, or evaluation quality degrades as you scale, I would welcome the chance to compare notes. I spend a lot of time working with teams navigating these exact problems and am always interested in learning from what others are seeing in the wild. Please reach out!