Context Engineering: Beyond Prompt Engineering
Prompt engineering is dead. The real leverage is in what you feed the model before the prompt—system instructions, retrieved documents, conversation history, tool outputs. Here's how to think about context as a first-class engineering problem.
Everyone talks about prompt engineering. Few talk about what actually matters: context engineering.
The prompt is the question you ask. The context is everything that shapes how the model understands that question. And in production systems, the context window—not the prompt—is where the real engineering happens.
The Context Window as a Scarce Resource
Modern LLMs offer generous context windows: 128k tokens for GPT-4, 200k for Claude. But treating context as unlimited is a mistake.
Here’s why:
-
Attention degrades with length. The “lost in the middle” phenomenon is real—information in the middle of long contexts gets less attention.
-
Cost scales linearly. Every token costs money. A 100k context costs 10x more than a 10k context.
-
Latency increases. Longer contexts mean slower responses. Users notice.
Context engineering is the discipline of maximizing signal and minimizing noise within your token budget.
The Anatomy of Context
Every LLM call has multiple context layers:
┌─────────────────────────────────────────┐
│ System Instructions │
│ (Role, capabilities, constraints) │
├─────────────────────────────────────────┤
│ Retrieved Documents │
│ (RAG results, relevant context) │
├─────────────────────────────────────────┤
│ Conversation History │
│ (Prior turns, user preferences) │
├─────────────────────────────────────────┤
│ Tool Outputs │
│ (Function results, API responses) │
├─────────────────────────────────────────┤
│ User Query │
│ (The actual question/request) │
└─────────────────────────────────────────┘
Each layer serves a purpose. Each layer competes for tokens. Engineering context means making intentional decisions about what goes where—and what gets cut.
System Instructions: Setting the Stage
Your system prompt isn’t a suggestion box. It’s a contract.
SYSTEM_PROMPT = """You are a customer support agent for Acme Corp.
## Core Behaviors
- Answer questions about Acme products and policies
- Never provide medical, legal, or financial advice
- Escalate to human support when uncertain
## Response Format
- Use bullet points for multi-step instructions
- Include relevant product links when available
- Keep responses under 200 words unless complexity requires more
## Knowledge Boundaries
- You know Acme's public product catalog and policies
- You do NOT have access to customer account information
- You cannot process returns or refunds directly
## Examples
User: "How do I reset my password?"
Response: [Show step-by-step process]
User: "What's my order status?"
Response: "I don't have access to your account. Please visit [link] or contact support at [email]."
"""
Notice the structure:
- Core behaviors define what the agent does
- Response format constrains how it responds
- Knowledge boundaries prevent hallucination
- Examples demonstrate expected behavior
This isn’t a prompt—it’s a specification.
Retrieved Context: Quality Over Quantity
The temptation with RAG is to stuff as much retrieved content as possible into the context. More information must mean better answers, right?
Wrong.
def prepare_context(
query: str,
documents: list[Document],
max_tokens: int = 4000
) -> str:
# Sort by relevance (reranker already did this)
# But also consider recency and source authority
weighted_docs = []
for doc in documents:
weight = doc.relevance_score
# Boost recent documents
if doc.updated_at > datetime.now() - timedelta(days=30):
weight *= 1.2
# Boost authoritative sources
if doc.source in AUTHORITATIVE_SOURCES:
weight *= 1.3
weighted_docs.append((doc, weight))
# Take top docs within token budget
selected = []
current_tokens = 0
for doc, weight in sorted(weighted_docs, key=lambda x: -x[1]):
doc_tokens = count_tokens(doc.content)
if current_tokens + doc_tokens > max_tokens:
# Try to include a summary instead
summary = summarize(doc.content, max_tokens=200)
if current_tokens + count_tokens(summary) <= max_tokens:
selected.append(f"[Summary] {summary}")
current_tokens += count_tokens(summary)
continue
selected.append(doc.content)
current_tokens += doc_tokens
return "\n\n---\n\n".join(selected)
Key principles:
- Relevance first, but not only. Recency and authority matter too.
- Hard token budgets. Never exceed your allocation.
- Graceful degradation. Summaries are better than nothing.
Conversation History: Memory Management
Long conversations accumulate context. Left unchecked, you’ll hit token limits—or worse, pay for irrelevant history.
class ConversationManager:
def __init__(self, max_turns: int = 10, max_tokens: int = 2000):
self.max_turns = max_turns
self.max_tokens = max_tokens
def prepare_history(
self,
messages: list[Message]
) -> list[Message]:
# Always keep system message
system = messages[0] if messages[0].role == "system" else None
# Keep recent messages
recent = messages[-self.max_turns:]
# Summarize older messages if they contain important context
older = messages[len(messages) - self.max_turns - 10:-self.max_turns]
if older:
summary = self.summarize_conversation(older)
recent = [Message(role="system", content=f"Previous context: {summary}")] + recent
# Ensure we're within token budget
while self.count_tokens(recent) > self.max_tokens and len(recent) > 2:
# Remove oldest non-system message
recent = [recent[0]] + recent[2:]
return recent
The art is in deciding what to remember. User preferences? Keep them. A debugging tangent from five turns ago? Summarize or drop it.
Tool Outputs: Structured Context
When your LLM calls tools, the outputs become context for the next generation. Structure matters.
class ToolOutputFormatter:
def format(self, tool_name: str, output: Any) -> str:
if tool_name == "database_query":
return self.format_query_results(output)
elif tool_name == "api_call":
return self.format_api_response(output)
elif tool_name == "calculation":
return self.format_calculation(output)
else:
return json.dumps(output, indent=2)
def format_query_results(self, results: list[dict]) -> str:
if not results:
return "No results found."
if len(results) > 10:
# Summarize large result sets
return f"""Query returned {len(results)} results.
Top 5 results:
{self.format_table(results[:5])}
Summary statistics:
{self.compute_summary(results)}"""
return self.format_table(results)
Never dump raw JSON into your context. Structure it for the model. Summarize when appropriate. The model doesn’t need 500 rows—it needs the insight.
The Context Budget Framework
I use a simple framework to allocate context tokens:
| Layer | Allocation | Priority |
|---|---|---|
| System Instructions | 500-1000 tokens | Fixed |
| Retrieved Documents | 2000-4000 tokens | Variable |
| Conversation History | 1000-2000 tokens | Variable |
| Tool Outputs | 500-1000 tokens | Variable |
| User Query | 100-500 tokens | Fixed |
| Response Buffer | 1000-2000 tokens | Fixed |
Total budget: 8k-12k tokens for most use cases.
The key insight: you don’t need 128k tokens. You need the right 8k tokens.
Practical Patterns
Pattern 1: Context Compression
When you have more context than budget, compress intelligently:
def compress_context(content: str, target_tokens: int) -> str:
current_tokens = count_tokens(content)
if current_tokens <= target_tokens:
return content
# Try extractive summarization first
summary = extractive_summarize(content, ratio=target_tokens/current_tokens)
if count_tokens(summary) <= target_tokens:
return summary
# Fall back to aggressive truncation with ellipsis
return truncate_with_context(content, target_tokens)
Pattern 2: Dynamic Allocation
Different queries need different context mixes:
def allocate_context(query_type: str) -> dict:
if query_type == "factual":
return {
"retrieved_docs": 4000,
"conversation": 500,
"tool_outputs": 0
}
elif query_type == "conversational":
return {
"retrieved_docs": 1000,
"conversation": 3000,
"tool_outputs": 500
}
elif query_type == "analytical":
return {
"retrieved_docs": 2000,
"conversation": 1000,
"tool_outputs": 2000
}
Pattern 3: Context Caching
Expensive context preparation? Cache it.
@lru_cache(maxsize=1000)
def prepare_static_context(user_id: str) -> str:
"""Cache user-specific context that doesn't change often."""
preferences = get_user_preferences(user_id)
history_summary = get_interaction_summary(user_id)
return f"""User Context:
- Preferences: {preferences}
- History: {history_summary}"""
The Future of Context
As context windows grow, the temptation to “just use more context” will increase. Resist it.
The models that win in production aren’t the ones with the longest context windows—they’re the ones with the most thoughtfully curated context. Quality beats quantity. Always.
Context engineering isn’t about maximizing what you put in. It’s about maximizing what comes out.
Building AI systems that need robust context management? Let’s talk about production architectures.