200,000 tokens sounds enormous until you realize that your system prompt, CLAUDE.md files, MCP tool descriptions, conversation history, and extended thinking are all competing for the same window. Managing context is about deciding what stays, what gets compressed, and what you pay for every turn.
Effort Levels
The --effort flag controls how deeply Claude reasons before responding. Lower effort means fewer thinking tokens and faster answers, but also less thorough analysis for complex tasks.
Effort Levels
| Level | Behavior | Best For | Token Impact |
|---|---|---|---|
low | Minimal reasoning, fastest responses | Simple lookups, yes/no questions, classification | Near-zero thinking tokens |
medium | Balanced reasoning | Moderate complexity, code formatting, summaries | Modest thinking budget |
high | Full reasoning (default) | Most development tasks, code generation, debugging | 10K+ thinking tokens possible |
max | Maximum reasoning depth | Complex architecture, multi-file refactoring, subtle bugs | 20K+ thinking tokens possible |
You can set effort via the CLI flag, an environment variable, or in your CLAUDE.md:
# CLI flagclaude -p "Refactor auth.py" --effort medium
# Environment variableexport CLAUDE_EFFORT=medium
# Or in CLAUDE.md# Settingseffort: mediumFor cost-sensitive CI pipelines, combine effort with MAX_THINKING_TOKENS to get full reasoning within a hard ceiling:
MAX_THINKING_TOKENS=8000 claude -p "Refactor auth.py" --effort highThis gives high-quality reasoning but prevents runaway thinking token spend. At Opus pricing ($75/M output tokens), an uncapped complex prompt can burn 20K+ thinking tokens — $1.50 in a single response just for thinking.
Context Consumers
Every piece of information in Claude’s context window costs tokens. Use the /context command inside an interactive session to see a live breakdown.
What Fills the Context Window
| Consumer | Typical Size | Notes |
|---|---|---|
| System prompt | 5K-10K tokens | Always present, auto-cached |
| CLAUDE.md files | 1K-10K tokens | All .claude.md files in hierarchy are loaded |
| MCP tool descriptions | 10K-50K+ tokens | The silent context killer — 20+ MCP servers can shrink effective context to ~150K |
| Conversation history | Grows per turn | Largest consumer in long sessions |
| Extended thinking | 10K+ per response | Output tokens, expensive at Opus rates ($75/M) |
| Tool results | Varies widely | Large file reads can consume 5K+ each |
Where Do Your Tokens Go?
MCP tool descriptions deserve special attention. Each tool description consumes 200-500 tokens. With 80+ tools across multiple MCP servers, you can lose 40K-50K tokens before you even start working. Audit with /context, disable unused servers in .claude/settings.json, and use ToolSearch deferred loading to load tool schemas on demand rather than upfront:
# Trigger deferred loading when MCP tools exceed 5% of contextENABLE_TOOL_SEARCH=auto:5 claudeThe /compact Command
When your context is filling up, the /compact command summarizes the conversation so far and replaces the full history with a compressed version. The key advantage over waiting for auto-compaction is control: you tell Claude what to preserve.
You can also embed compaction instructions directly in your CLAUDE.md so they apply automatically every time compaction occurs, whether manual or automatic:
# Compact instructionsWhen you are using compact, please focus on test output and code changesThree ways to manage context bloat, each for a different situation:
/compact vs /clear vs --resume
| Command | What Happens | Use When |
|---|---|---|
/compact | Summarizes conversation, keeps working state | Context filling up, still working on the same task |
/compact [focus] | Summarizes with emphasis on specified topics | You know which details matter most |
/clear | Wipes conversation entirely, re-reads CLAUDE.md | Starting a completely new task in same session |
—resume | Loads a previous session from disk | Continuing work across terminal sessions |
Auto-Compaction
Auto-compaction triggers at approximately 80-90% context window utilization. When it fires, Claude summarizes the entire conversation, replaces the full history with that summary, re-reads CLAUDE.md from disk, and condenses tool results to their essentials. You do not control what gets prioritized — the model decides.
The problem with auto-compaction is that it is reactive. By the time it triggers, subtle early details may already be compressed beyond recovery. A design decision rationale from turn 3 might become a single sentence by turn 40. The proactive strategy is to run /compact manually at around 60% utilization with explicit focus instructions, so you decide what survives rather than leaving it to the automatic trigger.
For tasks that require reading many files or producing verbose output, consider delegating to a subagent using the Task tool. Subagents run in their own context windows, keeping the main conversation lean while still doing thorough analysis.
Cache Economics
Claude Code automatically caches system prompts and CLAUDE.md content. Understanding the two cache tiers is the difference between a $50 session and a $10 session.
Cache Tiers
| Cache Type | Duration | Cost Multiplier | Notes |
|---|---|---|---|
| Ephemeral cache write | 5 minutes | 1.25x base input price | Auto-expires, created on every call |
| Long-lived cache write | 1 hour | 2.0x base input price | For stable content like system prompts |
| Cache read | N/A | 0.1x base input price | 90% savings on repeated content |
The first call in a session writes the system prompt (~14K tokens) to cache at 1.25-2.0x the base rate. Every subsequent call within 5 minutes reads that cached content at 0.1x the base rate. After 5 minutes of inactivity, the ephemeral cache expires and the next call pays the full write cost again.
For a 100-turn Opus session, the savings are dramatic: without caching, 100 turns of 15K input tokens at $15/M costs roughly $22.50 in input tokens alone. With caching, 1 write plus 99 reads comes to about $2.51 — an 89% reduction on input tokens. Real-world total session costs drop from $50-100 to $10-19.
These two payloads were identical prompts (“What is 2+2?”) run 30 seconds apart. The effort low run cost 3.8x more than effort high — but not because of effort level. The entire cost difference comes from cache state. The first call had a cache miss (9104 creation tokens), while the second hit a warm cache (14253 read tokens). On subsequent runs with a warm cache, costs converge regardless of effort. For trivial prompts, effort level does not change the answer or the cost.
Auto-compaction fires at 80-90% context utilization and decides on its own what to keep. Subtle details from early in the conversation — design rationales, rejected approaches, edge case discussions — can be compressed to a single sentence or lost entirely. Run /compact proactively at ~60% utilization with explicit focus instructions to control what survives.
CLAUDE.md always survives compaction. When auto-compaction or /compact runs, CLAUDE.md is re-read from disk rather than being summarized. This means your project instructions, style guides, and custom compaction instructions persist across every compaction cycle. Put your most critical context there rather than relying on conversation history to preserve it.
The —effort low flag does not guarantee cheaper calls. As the payload comparison above shows, cache state has a far bigger impact on cost than effort level for simple prompts. Effort primarily affects reasoning depth and thinking tokens on complex tasks. If you want to minimize cost, focus on cache hit rates: batch related calls together and avoid pausing more than 5 minutes between them.