Token Explosion in AI Agents: Why Your Costs Scale Exponentially

I built an AI agent from scratch. Not because frameworks aren't good. They are(and I suggest you use them). But because I needed to see where every token goes.
When you're building production systems that could cost $150K+/year in LLM tokens alone, you can't afford to treat token usage as an afterthought. Yet most teams do. They prototype with frameworks, scale to production, and then wonder why their AWS bill looks like a startup runway burn rate.
This is the story of what I found when I stripped away abstractions and measured token costs at the bare metal level. The numbers tell a story that most builders don't see until it's too late.
The Setup: Building an AI Agent from Scratch
I built a network device monitoring agent, the kind enterprises use for infrastructure observability. Think querying device metrics, analyzing performance trends, checking network topology, and troubleshooting connectivity issues.
Why this use case?
Real-world complexity (not a toy chatbot)
Tool diversity (CRUD operations, time-series analytics, graph queries)
Realistic conversation patterns (engineers troubleshooting issues in multi-turn conversations)
The stack:
Model: gpt-4o-mini (cost-conscious, production-grade)
Tools: 6 functions covering device metrics, historical data, topology, and paths
Data: Mock implementations of TimescaleDB (time-series) and Neo4j (graph) structures
Framework: None. Pure Python with OpenAI API.
Why no framework?
Frameworks like LangChain and LlamaIndex are production-ready and handle a lot of complexity. But they abstract away cost mechanics. When token usage becomes the dominant operating expense, you need visibility frameworks don't provide.
I wanted to measure:
How many tokens does each tool definition consume?
How does conversation depth impact costs?
What happens in multi-turn conversations?
Where exactly does the exponential growth come from?
The approach: Four phases, each isolating a different variable. No optimizations until measurement is complete. Pure observation.
Phase 1: The Baseline (Single Tool, Single Query)
Scenario: User asks: "Get me metrics for device DEV_ID_123"
Flow:
User query → LLM (with tool definitions)
LLM decides to call
get_device_metrics(device_id="DEV_ID_123")Tool executes, returns device data
Tool result → LLM
LLM synthesizes natural language answer

Token breakdown:
Call 1 (LLM decision):
- System prompt: ~100 tokens
- Tool definition: ~140 tokens
- User query: ~20 tokens
- LLM response (tool call): ~19 tokens
Total: ~279 tokens
Call 2 (LLM synthesis):
- Previous messages: ~297 tokens
- Tool result: ~200 tokens (JSON)
- LLM response (answer): ~134 tokens
Total: ~311 tokens
Phase 1 Total: ~590 tokens
Tool definition structure (why 140 tokens):
{
"type": "function",
"function": {
"name": "get_device_metrics",
"description": "Get detailed metrics and information for a specific network device by its device ID. Returns device name, type (router/switch/modem/core), location (city and area), operational status (operational/degraded/down), alias, and timestamp information.",
"parameters": {
"type": "object",
"properties": {
"device_id": {
"type": "string",
"description": "The unique device identifier (e.g., 'DEV_ID_123')"
}
},
"required": ["device_id"]
}
}
}
Every word in that description, every parameter definition—tokens. And this gets sent with EVERY query.
Baseline established: 590 tokens per query.
Phase 2: Tool Definition Scaling (1 Tool → 6 Tools)
What changed: Added 5 more tools:
get_device_metrics_timeseries- Historical CPU/memory/bandwidth dataget_devices_by_metric_threshold- Filter devices by performance metricsget_device_uptime_history- Uptime/downtime eventsget_device_neighbors- Network topology connectionsget_devices_in_path- Path between two devices
Query: Same as Phase 1—"Get me metrics for device DEV_ID_123"
Key insight: The LLM still picks the correct tool (get_device_metrics). But now it has 6 tool definitions to process instead of 1.

Token breakdown:
Call 1 (LLM decision):
- System prompt: ~100 tokens
- Tool definitions (6 tools): ~840 tokens ← 6x increase
- User query: ~20 tokens
- LLM response (tool call): ~19 tokens
Total: ~979 tokens
Call 2 (LLM synthesis):
- Previous messages (no tools): ~225 tokens
- Tool result: ~200 tokens
- LLM response: ~176 tokens
Total: ~601 tokens
Phase 2 Total: ~1,204 tokens
Result: 2.04x increase (590 → 1,204 tokens)
The math:
1 tool = 140 tokens
6 tools = 840 tokens (+700 tokens, or +119%)
Linear scaling: 10 tools = 1,400 tokens, 100 tools = 14,000 tokens
At scale: If you're building an enterprise agent with 70-100 tools across domains (network, database, application, infrastructure), you're paying 14,000 tokens per query just for tool definitions.
Cost projection (100 tools, 1,000 queries/day):
14K tokens × 1,000 queries = 14M tokens/day
14M × 365 = 5.1B tokens/year
At $0.150 per 1M input tokens (gpt-4o-mini): $765/year just for tool definitions
And we haven't even executed a single tool yet.
Phase 3: Conversation Depth (Multi-Tool Workflows)
Scenario: User asks: "Find devices with CPU above 70%, show their neighbors, and check paths from DEV_ID_123 to each high-CPU device"
This requires 3 sequential tool calls:
get_devices_by_metric_threshold- Find high-CPU devicesget_device_neighbors- Get neighbors for each deviceget_devices_in_path- Check paths
The problem: Each iteration carries the full conversation history forward.

Iteration breakdown:
Iteration 1:
Messages sent to LLM:
[
{role: "system", content: "..."},
{role: "user", content: "Find devices with CPU > 70%..."}
]
+ 6 tool definitions
Tokens: ~900
LLM decides to call get_devices_by_metric_threshold.
Iteration 2:
Messages sent to LLM:
[
{role: "system", content: "..."},
{role: "user", content: "Find devices with CPU > 70%..."},
{role: "assistant", tool_calls: [...]}, ← LLM's decision
{role: "tool", content: "{...filtered devices...}"} ← Tool result (~200 tokens)
]
+ 6 tool definitions
Tokens: ~1,100
LLM decides to call get_device_neighbors.
Iteration 3:
Messages sent to LLM:
[
{role: "system", content: "..."},
{role: "user", content: "Find devices with CPU > 70%..."},
{role: "assistant", tool_calls: [...]},
{role: "tool", content: "{...filtered devices...}"},
{role: "assistant", tool_calls: [...]}, ← Previous iteration
{role: "tool", content: "{...neighbors data...}"} ← ~300 tokens
]
+ 6 tool definitions
Tokens: ~1,500
LLM decides to call get_devices_in_path.
Final synthesis call:
All previous messages + final tool result
Tokens: ~1,800
Phase 3 average: ~2,910 tokens (across multiple queries, averaging 2.2 iterations)
Result: 2.42x increase from Phase 2
Why this happens:
LLMs are stateless. They don't "remember" previous calls. The ONLY way they know what happened before is if you send the entire conversation history.
Each iteration isn't just "new query + new tool result." It's:
All previous user messages
All previous LLM decisions (tool calls)
All previous tool results
Plus the new stuff
The amplifier effect:
Some tools return large responses. Our get_device_metrics_timeseries returns 24 hours of CPU/memory/bandwidth data—about 400 tokens of JSON.
When that gets included in iteration 2, 3, 4... it's not just 400 tokens once. It's 400 tokens replayed in every subsequent LLM call.
Conversation structure after 3 iterations:
[
{"role": "system", "content": "..."}, # 100 tokens
# Iteration 1
{"role": "user", "content": "..."}, # 50 tokens
{"role": "assistant", "tool_calls": [...]}, # 30 tokens
{"role": "tool", "content": "{...}"}, # 200 tokens
# Iteration 2
{"role": "assistant", "tool_calls": [...]}, # 30 tokens
{"role": "tool", "content": "{...}"}, # 300 tokens
# Iteration 3
{"role": "assistant", "tool_calls": [...]}, # 30 tokens
{"role": "tool", "content": "{...}"}, # 250 tokens
# Final synthesis
{"role": "assistant", "content": "Based on the data..."} # 150 tokens
]
Total history: ~1,140 tokens (before tool definitions)
+ 6 tool definitions: ~840 tokens
= ~1,980 tokens just to maintain context
Phase 4: Multi-Turn Conversations (The Real Killer)
Scenario: Three-turn conversation with context references:
Turn 1: "Show me metrics for DEV_ID_123"
Turn 2: "What about its neighbors?" ← refers to DEV_ID_123
Turn 3: "Check uptime for those neighbors" ← refers to neighbors from Turn 2
The challenge: Turn 3 needs the full conversation history to understand "those neighbors."

Turn-by-turn breakdown:
Turn 1:
Messages:
[
{role: "system", content: "..."},
{role: "user", content: "Show me metrics for DEV_ID_123"},
{role: "assistant", tool_calls: [...]},
{role: "tool", content: "{...device data...}"},
{role: "assistant", content: "Device DEV_ID_123 is operational..."}
]
Tokens: ~1,591
Turn 2:
Messages:
[
{role: "system", content: "..."},
# Turn 1 history (all of it)
{role: "user", content: "Show me metrics for DEV_ID_123"},
{role: "assistant", tool_calls: [...]},
{role: "tool", content: "{...device data...}"},
{role: "assistant", content: "Device DEV_ID_123 is operational..."},
# Turn 2 (new)
{role: "user", content: "What about its neighbors?"},
{role: "assistant", tool_calls: [...]},
{role: "tool", content: "{...neighbors data...}"},
{role: "assistant", content: "DEV_ID_123 has 3 neighbors..."}
]
Tokens: ~2,379 (+50% from Turn 1)
Turn 3:
Messages:
[
{role: "system", content: "..."},
# Turn 1 history
{role: "user", content: "Show me metrics for DEV_ID_123"},
{role: "assistant", tool_calls: [...]},
{role: "tool", content: "{...device data...}"},
{role: "assistant", content: "Device DEV_ID_123 is operational..."},
# Turn 2 history
{role: "user", content: "What about its neighbors?"},
{role: "assistant", tool_calls: [...]},
{role: "tool", content: "{...neighbors data...}"},
{role: "assistant", content: "DEV_ID_123 has 3 neighbors..."},
# Turn 3 (new)
{role: "user", content: "Check uptime for those neighbors"},
{role: "assistant", tool_calls: [...]},
{role: "tool", content: "{...uptime data...}"},
{role: "assistant", content: "All three neighbors have 99%+ uptime..."}
]
Tokens: ~4,118 (+73% from Turn 2)
Phase 4 average: ~7,166 tokens per 3-turn conversation
Result: 2.46x increase from Phase 3
Growth pattern:
Turn 1: 1,591 tokens (baseline)
Turn 2: 2,379 tokens (+50%)
Turn 3: 4,118 tokens (+73%)
This is exponential, not linear.
Context dependency matters:
We tested 4 conversation patterns:
Linked context (pronouns: "its", "those")
Average: 8,088 tokens
Cannot truncate history without breaking references
Independent questions (no context overlap)
Average: 6,247 tokens
80% of history is pure waste
Mixed pattern (partial dependencies)
Average: 7,164 tokens
Needs smart selective retention
Error recovery (corrections, retries)
- Failed in testing (implementation gap)
The universal truth:
This isn't specific to my implementation. This is how ALL LLMs work:
ChatGPT
Claude
Gemini
Every LangChain/LlamaIndex app
LLMs are stateless. Conversation history is the ONLY way they "remember." Every production system sends the full conversation on every turn.
Why tool_calls AND tool_results must be sent:
You might think: "Can't we just send the assistant's final answers and skip the tool internals?"
No. The OpenAI API requires this structure:
[
{"role": "assistant", "tool_calls": [{"id": "call_abc123", ...}]},
{"role": "tool", "tool_call_id": "call_abc123", "content": "{...}"}
]
The tool_call_id must match. The LLM needs to see:
What tool it decided to call (reasoning chain)
What data came back (to reference in synthesis)
The full context (to make follow-up decisions)
You can't skip the tool internals without breaking the API contract.
Each turn in history includes:
User message (~20 tokens)
Assistant tool_call decision (~30 tokens)
Tool result (~200-400 tokens, depending on response size)
Assistant synthesis (~150 tokens)
Multiply by number of turns. That's your history cost.
The Complete Picture: From 590 to 7,166 Tokens

| Phase | Scenario | Tokens | Multiplier | Cost/Year* |
| Phase 1 | Single tool, single query | 590 | 1.0x | $32 |
| Phase 2 | 6 tools, single query | 1,204 | 2.0x | $66 |
| Phase 3 | 6 tools, multi-tool workflow | 2,910 | 4.9x | $159 |
| Phase 4 | 6 tools, 3-turn conversation | 7,166 | 12.1x | $392 |
*Assumes 1,000 queries/day, 365 days, gpt-4o-mini pricing
The exponential pattern:
Adding 5 tools: 2x cost
Adding 2 workflow iterations: 2.4x cost
Adding 2 conversation turns: 2.5x cost
Compound effect: 12.1x from baseline
Conversation depth costs more than tool quantity.
This isn't obvious until you measure it.
The Scaling Nightmare
Extrapolate to production scale:
Enterprise monitoring agent:
100 tools (network, database, application, infrastructure)
5-turn conversations (realistic troubleshooting session)
50 queries/user/day
100 power users
Token projection:
Tool definitions: 14,000 tokens
Conversation depth: 10,000 tokens (5 iterations avg)
History accumulation: 20,000+ tokens (5 turns)
Total per conversation: ~44,000 tokens
Daily usage: 100 users × 50 queries = 5,000 queries
Daily tokens: 5,000 × 44,000 = 220M tokens
Annual tokens: 220M × 365 = 80.3B tokens
Cost (gpt-4o-mini):
- Input: 80.3B × $0.150/1M = $12,045/year
- Output: 20B × $0.600/1M = $12,000/year
Total: $24,045/year minimum
Cost (gpt-4):
- Input: 80.3B × $2.50/1M = $200,750/year
- Output: 20B × $10/1M = $200,000/year
Total: $400,750/year
And this is JUST token costs. Not infrastructure, engineering, support, or training data.
At 1,000 users: $2.4M/year (gpt-4o-mini) or $40M/year (gpt-4).
Token management isn't a nice-to-have. It's a fundamental cost driver.
What Production Systems Do (And Their Trade-offs)
Every AI company faces this. Here's what they do:
1. Summarization (OpenAI, Anthropic)
Strategy: After N turns, replace old messages with a summary.
Example:
Turn 1-5: [full messages] - 10,000 tokens
Becomes: [summary] - 500 tokens
Trade-offs:
✅ Massive token savings (20x compression)
❌ Loses detail (can't reference specific data points)
❌ Summarization can hallucinate or miss nuance
❌ Adds latency (extra LLM call for summarization)
2. Sliding Window (Common Pattern)
Strategy: Keep only last N turns, drop the rest.
Example:
Conversation with 10 turns
Keep: Turn 8, 9, 10
Drop: Turn 1-7
Trade-offs:
✅ Simple to implement
✅ Predictable token usage
❌ Can't reference old context ("Remember that device from Turn 3?")
❌ Breaks long troubleshooting sessions
3. Semantic Compression (Advanced)
Strategy: Analyze conversation, identify essential messages, drop irrelevant ones.
Example:
Turn 1: "Show device metrics" → Keep (context for Turn 2)
Turn 2: "What about neighbors?" → Keep (context for Turn 3)
Turn 3: "Show uptime" → Keep (most recent)
Turn 4: Independent query → Drop (not referenced later)
Trade-offs:
✅ Optimal token usage (keep only what's needed)
✅ Maintains coherence for linked context
❌ Complex logic (requires NLP analysis)
❌ Can make mistakes (drop something that's referenced later)
❌ Engineering overhead
4. RAG for Long Conversations (Enterprise)
Strategy: Store conversation in vector database, retrieve relevant snippets on demand.
Example:
Full conversation: 50 turns in vector DB
Current query: "What was that error from earlier?"
Retrieve: Turn 12, 13, 14 (error context)
Send to LLM: Only retrieved turns + current query
Trade-offs:
✅ Scales to very long conversations
✅ Semantic retrieval (finds relevant context)
❌ High engineering complexity
❌ Retrieval can miss context
❌ Adds latency (DB query + embedding)
5. Truncate Tool Results (Our Insight)
Strategy: Keep assistant responses (natural language), drop or compress tool_calls and tool_results.
Example:
Instead of:
{role: "tool", content: "{cpu: 78%, memory: 85%, bandwidth: 920mbps, ...400 tokens}"}
Send:
{role: "tool", content: "Summary: High CPU (78%), memory normal"}
Trade-offs:
✅ 3-5x reduction in history size
✅ Maintains conversational coherence (assistant answers kept)
❌ LLM can't reference raw data ("What was the exact CPU value?")
❌ Requires smart summarization logic
None of these are perfect. Everyone struggles with this.
The industry is actively researching better solutions. But for now, this is the reality.
What We're Testing Next
Phase 3: Execution Optimizations (Tactical)
Parallel tool execution
Execute independent tools concurrently
Reduces iterations (3 sequential calls → 1 parallel batch)
Target: 30-40% token reduction
Smart history truncation
Keep assistant responses, drop tool internals
Context-aware (keep turns with pronoun references)
Target: 3-5x reduction in history size
Tool result summarization
Compress large JSON responses (timeseries → summary stats)
Keep raw data in external store, reference by ID
Target: 2-3x reduction per large tool response
Phase 4: Tool Selection Optimization (Strategic)
The 10x win. This is where it gets interesting.
The problem: 100 tools × 140 tokens = 14,000 tokens per query.
The solution: Don't send all 100 tools. Send the top 5-10 most relevant.
Approaches we'll test:
Semantic routing (vector embeddings)
Embed tool descriptions in vector space
Embed user query
Retrieve top-K most similar tools
Send only those to LLM
Target: 14,000 → 1,400 tokens (10x)
Hierarchical tool organization
Category tools: "network", "database", "application"
LLM first picks category (1 LLM call)
Then picks specific tool from category (2nd LLM call)
Target: 14,000 → 2,000 tokens (7x)
Two-stage LLM (routing + execution)
Stage 1: Lightweight routing model picks tools (cheap)
Stage 2: Main model executes with only selected tools
Target: 14,000 → 1,500 tokens (9x)
Hypothesis: Tool selection optimization is more valuable than conversation compression.
We'll measure and share results.
Key Takeaways
For builders:
Measure before optimizing. You can't improve what you don't understand. Build visibility into your system from day 1.
Token costs are architectural, not incidental. Like database indexing or cache strategy, token management is a fundamental design concern.
Frameworks are great, but understand what they hide. LangChain and LlamaIndex solve real problems. But they abstract away cost mechanics. Know when to use them and when to build custom.
Conversation depth costs more than tool quantity. Adding 5 tools doubled costs. Adding 2 conversation turns tripled them. Multi-turn conversations are exponentially expensive.
For architects:
Budget for 3-5x token growth in production vs prototype. Your PoC that costs $50/month will cost $500-1,000/month at scale. Plan accordingly.
Context window limits are real. gpt-4o-mini has a 128K token context window. At our Phase 4 rate (2,696 tokens/turn), that's ~47 turns before you hit the limit. Then you MUST truncate or summarize.
LLMs are stateless everywhere. ChatGPT, Claude, Gemini—everyone faces this. Conversation history is the only way to maintain context. Design your system with this constraint in mind.
Tool selection > conversation compression (hypothesis to test). At 100 tools, reducing tool definitions from 14K → 1.4K saves more than aggressive history truncation.
For consultants:
This is a differentiator. Most teams don't measure token usage this deeply. They prototype, scale, and then panic when costs explode. Understanding token economics gives you a 5-10x cost advantage.
Cost optimization is strategic, not tactical. Picking gpt-4o-mini over gpt-4 is tactical (3x savings). Semantic tool routing is strategic (10x savings). Both matter, but strategic wins compound.
Token mechanics = AI economics. If you're advising clients on AI adoption, you need to understand this. Token costs are to AI what compute costs are to cloud infrastructure.
Conclusion
I started this investigation because I kept hearing: "LLM costs are manageable if you optimize prompts and pick the right model."
That's true for simple use cases. But for production AI agents with:
Dozens of tools
Multi-step workflows
Multi-turn conversations
Power users running hundreds of queries per day
...prompt optimization is noise. The signal is architectural.
Token costs don't scale linearly. They compound:
Tool definitions (linear)
Conversation depth (exponential)
History accumulation (exponential)
At enterprise scale, this becomes a $100K-$1M/year line item. That's not a rounding error. That's a strategic decision.
The good news: It's solvable. Semantic routing, smart truncation, parallel execution—these aren't exotic techniques. They're engineering problems with known solutions.
But you can't solve what you don't measure.
Build visibility. Measure religiously. Optimize strategically.
That's the difference between an AI prototype and an AI product.
About the author: I'm an independent technical consultant with 15 years of experience building production systems. Currently conducting systematic research into LLM optimization and token economics. Follow along as I share results from other phases of my token research.
Want to discuss token optimization strategies for your AI system? Drop a comment or reach out. I'm always interested in comparing notes with other builders tackling this problem.






