Context Engineering: Beyond Prompt Engineering

Everyone talks about prompt engineering. Few talk about what actually matters: context engineering.

The prompt is the question you ask. The context is everything that shapes how the model understands that question. And in production systems, the context window—not the prompt—is where the real engineering happens.

The Context Window as a Scarce Resource

Modern LLMs offer generous context windows: 128k tokens for GPT-4, 200k for Claude. But treating context as unlimited is a mistake.

Here’s why:

Attention degrades with length. The “lost in the middle” phenomenon is real—information in the middle of long contexts gets less attention.
Cost scales linearly. Every token costs money. A 100k context costs 10x more than a 10k context.
Latency increases. Longer contexts mean slower responses. Users notice.

Context engineering is the discipline of maximizing signal and minimizing noise within your token budget.

The Anatomy of Context

Every LLM call has multiple context layers:

┌─────────────────────────────────────────┐
│           System Instructions           │
│  (Role, capabilities, constraints)      │
├─────────────────────────────────────────┤
│          Retrieved Documents            │
│  (RAG results, relevant context)        │
├─────────────────────────────────────────┤
│         Conversation History            │
│  (Prior turns, user preferences)        │
├─────────────────────────────────────────┤
│            Tool Outputs                 │
│  (Function results, API responses)      │
├─────────────────────────────────────────┤
│             User Query                  │
│  (The actual question/request)          │
└─────────────────────────────────────────┘

Each layer serves a purpose. Each layer competes for tokens. Engineering context means making intentional decisions about what goes where—and what gets cut.

System Instructions: Setting the Stage

Your system prompt isn’t a suggestion box. It’s a contract.

SYSTEM_PROMPT = """You are a customer support agent for Acme Corp.

## Core Behaviors
- Answer questions about Acme products and policies
- Never provide medical, legal, or financial advice
- Escalate to human support when uncertain

## Response Format
- Use bullet points for multi-step instructions
- Include relevant product links when available
- Keep responses under 200 words unless complexity requires more

## Knowledge Boundaries
- You know Acme's public product catalog and policies
- You do NOT have access to customer account information
- You cannot process returns or refunds directly

## Examples
User: "How do I reset my password?"
Response: [Show step-by-step process]

User: "What's my order status?"
Response: "I don't have access to your account. Please visit [link] or contact support at [email]."
"""

Notice the structure:

Core behaviors define what the agent does
Response format constrains how it responds
Knowledge boundaries prevent hallucination
Examples demonstrate expected behavior

This isn’t a prompt—it’s a specification.

Retrieved Context: Quality Over Quantity

The temptation with RAG is to stuff as much retrieved content as possible into the context. More information must mean better answers, right?

Wrong.

def prepare_context(
    query: str,
    documents: list[Document],
    max_tokens: int = 4000
) -> str:
    # Sort by relevance (reranker already did this)
    # But also consider recency and source authority

    weighted_docs = []
    for doc in documents:
        weight = doc.relevance_score

        # Boost recent documents
        if doc.updated_at > datetime.now() - timedelta(days=30):
            weight *= 1.2

        # Boost authoritative sources
        if doc.source in AUTHORITATIVE_SOURCES:
            weight *= 1.3

        weighted_docs.append((doc, weight))

    # Take top docs within token budget
    selected = []
    current_tokens = 0

    for doc, weight in sorted(weighted_docs, key=lambda x: -x[1]):
        doc_tokens = count_tokens(doc.content)

        if current_tokens + doc_tokens > max_tokens:
            # Try to include a summary instead
            summary = summarize(doc.content, max_tokens=200)
            if current_tokens + count_tokens(summary) <= max_tokens:
                selected.append(f"[Summary] {summary}")
                current_tokens += count_tokens(summary)
            continue

        selected.append(doc.content)
        current_tokens += doc_tokens

    return "\n\n---\n\n".join(selected)

Key principles:

Relevance first, but not only. Recency and authority matter too.
Hard token budgets. Never exceed your allocation.
Graceful degradation. Summaries are better than nothing.

Conversation History: Memory Management

Long conversations accumulate context. Left unchecked, you’ll hit token limits—or worse, pay for irrelevant history.

class ConversationManager:
    def __init__(self, max_turns: int = 10, max_tokens: int = 2000):
        self.max_turns = max_turns
        self.max_tokens = max_tokens

    def prepare_history(
        self,
        messages: list[Message]
    ) -> list[Message]:
        # Always keep system message
        system = messages[0] if messages[0].role == "system" else None

        # Keep recent messages
        recent = messages[-self.max_turns:]

        # Summarize older messages if they contain important context
        older = messages[len(messages) - self.max_turns - 10:-self.max_turns]
        if older:
            summary = self.summarize_conversation(older)
            recent = [Message(role="system", content=f"Previous context: {summary}")] + recent

        # Ensure we're within token budget
        while self.count_tokens(recent) > self.max_tokens and len(recent) > 2:
            # Remove oldest non-system message
            recent = [recent[0]] + recent[2:]

        return recent

The art is in deciding what to remember. User preferences? Keep them. A debugging tangent from five turns ago? Summarize or drop it.

Tool Outputs: Structured Context

When your LLM calls tools, the outputs become context for the next generation. Structure matters.

class ToolOutputFormatter:
    def format(self, tool_name: str, output: Any) -> str:
        if tool_name == "database_query":
            return self.format_query_results(output)
        elif tool_name == "api_call":
            return self.format_api_response(output)
        elif tool_name == "calculation":
            return self.format_calculation(output)
        else:
            return json.dumps(output, indent=2)

    def format_query_results(self, results: list[dict]) -> str:
        if not results:
            return "No results found."

        if len(results) > 10:
            # Summarize large result sets
            return f"""Query returned {len(results)} results.
Top 5 results:
{self.format_table(results[:5])}

Summary statistics:
{self.compute_summary(results)}"""

        return self.format_table(results)

Never dump raw JSON into your context. Structure it for the model. Summarize when appropriate. The model doesn’t need 500 rows—it needs the insight.

The Context Budget Framework

I use a simple framework to allocate context tokens:

Layer	Allocation	Priority
System Instructions	500-1000 tokens	Fixed
Retrieved Documents	2000-4000 tokens	Variable
Conversation History	1000-2000 tokens	Variable
Tool Outputs	500-1000 tokens	Variable
User Query	100-500 tokens	Fixed
Response Buffer	1000-2000 tokens	Fixed

Total budget: 8k-12k tokens for most use cases.

The key insight: you don’t need 128k tokens. You need the right 8k tokens.

Practical Patterns

Pattern 1: Context Compression

When you have more context than budget, compress intelligently:

def compress_context(content: str, target_tokens: int) -> str:
    current_tokens = count_tokens(content)

    if current_tokens <= target_tokens:
        return content

    # Try extractive summarization first
    summary = extractive_summarize(content, ratio=target_tokens/current_tokens)

    if count_tokens(summary) <= target_tokens:
        return summary

    # Fall back to aggressive truncation with ellipsis
    return truncate_with_context(content, target_tokens)

Pattern 2: Dynamic Allocation

Different queries need different context mixes:

def allocate_context(query_type: str) -> dict:
    if query_type == "factual":
        return {
            "retrieved_docs": 4000,
            "conversation": 500,
            "tool_outputs": 0
        }
    elif query_type == "conversational":
        return {
            "retrieved_docs": 1000,
            "conversation": 3000,
            "tool_outputs": 500
        }
    elif query_type == "analytical":
        return {
            "retrieved_docs": 2000,
            "conversation": 1000,
            "tool_outputs": 2000
        }

Pattern 3: Context Caching

Expensive context preparation? Cache it.

@lru_cache(maxsize=1000)
def prepare_static_context(user_id: str) -> str:
    """Cache user-specific context that doesn't change often."""
    preferences = get_user_preferences(user_id)
    history_summary = get_interaction_summary(user_id)

    return f"""User Context:
- Preferences: {preferences}
- History: {history_summary}"""

The Future of Context

As context windows grow, the temptation to “just use more context” will increase. Resist it.

The models that win in production aren’t the ones with the longest context windows—they’re the ones with the most thoughtfully curated context. Quality beats quantity. Always.

Context engineering isn’t about maximizing what you put in. It’s about maximizing what comes out.

Building AI systems that need robust context management? Let’s talk about production architectures.