I built an AI agent from scratch. Not because frameworks aren't good. They are(and I suggest you use them). But because I needed to see where every token goes.

When you're building production systems that could cost $150K+/year in LLM tokens alone, you can't afford to treat token usage as an afterthought. Yet most teams do. They prototype with frameworks, scale to production, and then wonder why their AWS bill looks like a startup runway burn rate.

This is the story of what I found when I stripped away abstractions and measured token costs at the bare metal level. The numbers tell a story that most builders don't see until it's too late.

The Setup: Building an AI Agent from Scratch

I built a network device monitoring agent, the kind enterprises use for infrastructure observability. Think querying device metrics, analyzing performance trends, checking network topology, and troubleshooting connectivity issues.

Why this use case?

Real-world complexity (not a toy chatbot)
Tool diversity (CRUD operations, time-series analytics, graph queries)
Realistic conversation patterns (engineers troubleshooting issues in multi-turn conversations)

The stack:

Model: gpt-4o-mini (cost-conscious, production-grade)
Tools: 6 functions covering device metrics, historical data, topology, and paths
Data: Mock implementations of TimescaleDB (time-series) and Neo4j (graph) structures
Framework: None. Pure Python with OpenAI API.

Why no framework?

Frameworks like LangChain and LlamaIndex are production-ready and handle a lot of complexity. But they abstract away cost mechanics. When token usage becomes the dominant operating expense, you need visibility frameworks don't provide.

I wanted to measure:

How many tokens does each tool definition consume?
How does conversation depth impact costs?
What happens in multi-turn conversations?
Where exactly does the exponential growth come from?

The approach: Four phases, each isolating a different variable. No optimizations until measurement is complete. Pure observation.

Phase 1: The Baseline (Single Tool, Single Query)

Scenario: User asks: "Get me metrics for device DEV_ID_123"

Flow:

User query → LLM (with tool definitions)
LLM decides to call get_device_metrics(device_id="DEV_ID_123")
Tool executes, returns device data
Tool result → LLM
LLM synthesizes natural language answer

Token breakdown:

Call 1 (LLM decision):
- System prompt: ~100 tokens
- Tool definition: ~140 tokens
- User query: ~20 tokens
- LLM response (tool call): ~19 tokens
Total: ~279 tokens

Call 2 (LLM synthesis):
- Previous messages: ~297 tokens
- Tool result: ~200 tokens (JSON)
- LLM response (answer): ~134 tokens
Total: ~311 tokens

Phase 1 Total: ~590 tokens

Tool definition structure (why 140 tokens):

{
  "type": "function",
  "function": {
    "name": "get_device_metrics",
    "description": "Get detailed metrics and information for a specific network device by its device ID. Returns device name, type (router/switch/modem/core), location (city and area), operational status (operational/degraded/down), alias, and timestamp information.",
    "parameters": {
      "type": "object",
      "properties": {
        "device_id": {
          "type": "string",
          "description": "The unique device identifier (e.g., 'DEV_ID_123')"
        }
      },
      "required": ["device_id"]
    }
  }
}

Every word in that description, every parameter definition—tokens. And this gets sent with EVERY query.

Baseline established: 590 tokens per query.

Phase 2: Tool Definition Scaling (1 Tool → 6 Tools)

What changed: Added 5 more tools:

get_device_metrics_timeseries - Historical CPU/memory/bandwidth data
get_devices_by_metric_threshold - Filter devices by performance metrics
get_device_uptime_history - Uptime/downtime events
get_device_neighbors - Network topology connections
get_devices_in_path - Path between two devices

Query: Same as Phase 1—"Get me metrics for device DEV_ID_123"

Key insight: The LLM still picks the correct tool (get_device_metrics). But now it has 6 tool definitions to process instead of 1.

Token breakdown:

Call 1 (LLM decision):
- System prompt: ~100 tokens
- Tool definitions (6 tools): ~840 tokens  ← 6x increase
- User query: ~20 tokens
- LLM response (tool call): ~19 tokens
Total: ~979 tokens

Call 2 (LLM synthesis):
- Previous messages (no tools): ~225 tokens
- Tool result: ~200 tokens
- LLM response: ~176 tokens
Total: ~601 tokens

Phase 2 Total: ~1,204 tokens

Result: 2.04x increase (590 → 1,204 tokens)

The math:

1 tool = 140 tokens
6 tools = 840 tokens (+700 tokens, or +119%)
Linear scaling: 10 tools = 1,400 tokens, 100 tools = 14,000 tokens

At scale: If you're building an enterprise agent with 70-100 tools across domains (network, database, application, infrastructure), you're paying 14,000 tokens per query just for tool definitions.

Cost projection (100 tools, 1,000 queries/day):

14K tokens × 1,000 queries = 14M tokens/day
14M × 365 = 5.1B tokens/year
At $0.150 per 1M input tokens (gpt-4o-mini): $765/year just for tool definitions

And we haven't even executed a single tool yet.

Phase 3: Conversation Depth (Multi-Tool Workflows)

Scenario: User asks: "Find devices with CPU above 70%, show their neighbors, and check paths from DEV_ID_123 to each high-CPU device"

This requires 3 sequential tool calls:

get_devices_by_metric_threshold - Find high-CPU devices
get_device_neighbors - Get neighbors for each device
get_devices_in_path - Check paths

The problem: Each iteration carries the full conversation history forward.

Iteration breakdown:

Iteration 1:

Messages sent to LLM:
[
  {role: "system", content: "..."},
  {role: "user", content: "Find devices with CPU > 70%..."}
]
+ 6 tool definitions

Tokens: ~900

LLM decides to call get_devices_by_metric_threshold.

Iteration 2:

Messages sent to LLM:
[
  {role: "system", content: "..."},
  {role: "user", content: "Find devices with CPU > 70%..."},
  {role: "assistant", tool_calls: [...]},        ← LLM's decision
  {role: "tool", content: "{...filtered devices...}"}  ← Tool result (~200 tokens)
]
+ 6 tool definitions

Tokens: ~1,100

LLM decides to call get_device_neighbors.

Iteration 3:

Messages sent to LLM:
[
  {role: "system", content: "..."},
  {role: "user", content: "Find devices with CPU > 70%..."},
  {role: "assistant", tool_calls: [...]},
  {role: "tool", content: "{...filtered devices...}"},
  {role: "assistant", tool_calls: [...]},        ← Previous iteration
  {role: "tool", content: "{...neighbors data...}"}    ← ~300 tokens
]
+ 6 tool definitions

Tokens: ~1,500

LLM decides to call get_devices_in_path.

Final synthesis call:

All previous messages + final tool result
Tokens: ~1,800

Phase 3 average: ~2,910 tokens (across multiple queries, averaging 2.2 iterations)

Result: 2.42x increase from Phase 2

Why this happens:

LLMs are stateless. They don't "remember" previous calls. The ONLY way they know what happened before is if you send the entire conversation history.

Each iteration isn't just "new query + new tool result." It's:

All previous user messages
All previous LLM decisions (tool calls)
All previous tool results
Plus the new stuff

The amplifier effect:

Some tools return large responses. Our get_device_metrics_timeseries returns 24 hours of CPU/memory/bandwidth data—about 400 tokens of JSON.

When that gets included in iteration 2, 3, 4... it's not just 400 tokens once. It's 400 tokens replayed in every subsequent LLM call.

Conversation structure after 3 iterations:

[
  {"role": "system", "content": "..."},  # 100 tokens

  # Iteration 1
  {"role": "user", "content": "..."},  # 50 tokens
  {"role": "assistant", "tool_calls": [...]},  # 30 tokens
  {"role": "tool", "content": "{...}"},  # 200 tokens

  # Iteration 2  
  {"role": "assistant", "tool_calls": [...]},  # 30 tokens
  {"role": "tool", "content": "{...}"},  # 300 tokens

  # Iteration 3
  {"role": "assistant", "tool_calls": [...]},  # 30 tokens
  {"role": "tool", "content": "{...}"},  # 250 tokens

  # Final synthesis
  {"role": "assistant", "content": "Based on the data..."}  # 150 tokens
]

Total history: ~1,140 tokens (before tool definitions)
+ 6 tool definitions: ~840 tokens
= ~1,980 tokens just to maintain context

Phase 4: Multi-Turn Conversations (The Real Killer)

Scenario: Three-turn conversation with context references:

Turn 1: "Show me metrics for DEV_ID_123"

Turn 2: "What about its neighbors?" ← refers to DEV_ID_123

Turn 3: "Check uptime for those neighbors" ← refers to neighbors from Turn 2

The challenge: Turn 3 needs the full conversation history to understand "those neighbors."

Turn-by-turn breakdown:

Turn 1:

Messages:
[
  {role: "system", content: "..."},
  {role: "user", content: "Show me metrics for DEV_ID_123"},
  {role: "assistant", tool_calls: [...]},
  {role: "tool", content: "{...device data...}"},
  {role: "assistant", content: "Device DEV_ID_123 is operational..."}
]

Tokens: ~1,591

Turn 2:

Messages:
[
  {role: "system", content: "..."},

  # Turn 1 history (all of it)
  {role: "user", content: "Show me metrics for DEV_ID_123"},
  {role: "assistant", tool_calls: [...]},
  {role: "tool", content: "{...device data...}"},
  {role: "assistant", content: "Device DEV_ID_123 is operational..."},

  # Turn 2 (new)
  {role: "user", content: "What about its neighbors?"},
  {role: "assistant", tool_calls: [...]},
  {role: "tool", content: "{...neighbors data...}"},
  {role: "assistant", content: "DEV_ID_123 has 3 neighbors..."}
]

Tokens: ~2,379 (+50% from Turn 1)

Turn 3:

Messages:
[
  {role: "system", content: "..."},

  # Turn 1 history
  {role: "user", content: "Show me metrics for DEV_ID_123"},
  {role: "assistant", tool_calls: [...]},
  {role: "tool", content: "{...device data...}"},
  {role: "assistant", content: "Device DEV_ID_123 is operational..."},

  # Turn 2 history
  {role: "user", content: "What about its neighbors?"},
  {role: "assistant", tool_calls: [...]},
  {role: "tool", content: "{...neighbors data...}"},
  {role: "assistant", content: "DEV_ID_123 has 3 neighbors..."},

  # Turn 3 (new)
  {role: "user", content: "Check uptime for those neighbors"},
  {role: "assistant", tool_calls: [...]},
  {role: "tool", content: "{...uptime data...}"},
  {role: "assistant", content: "All three neighbors have 99%+ uptime..."}
]

Tokens: ~4,118 (+73% from Turn 2)

Phase 4 average: ~7,166 tokens per 3-turn conversation

Result: 2.46x increase from Phase 3

Growth pattern:

Turn 1: 1,591 tokens (baseline)
Turn 2: 2,379 tokens (+50%)
Turn 3: 4,118 tokens (+73%)

This is exponential, not linear.

Context dependency matters:

We tested 4 conversation patterns:

Linked context (pronouns: "its", "those")
- Average: 8,088 tokens
- Cannot truncate history without breaking references
Independent questions (no context overlap)
- Average: 6,247 tokens
- 80% of history is pure waste
Mixed pattern (partial dependencies)
- Average: 7,164 tokens
- Needs smart selective retention
Error recovery (corrections, retries)
- Failed in testing (implementation gap)

The universal truth:

This isn't specific to my implementation. This is how ALL LLMs work:

ChatGPT
Claude
Gemini
Every LangChain/LlamaIndex app

LLMs are stateless. Conversation history is the ONLY way they "remember." Every production system sends the full conversation on every turn.

Why tool_calls AND tool_results must be sent:

You might think: "Can't we just send the assistant's final answers and skip the tool internals?"

No. The OpenAI API requires this structure:

[
  {"role": "assistant", "tool_calls": [{"id": "call_abc123", ...}]},
  {"role": "tool", "tool_call_id": "call_abc123", "content": "{...}"}
]

The tool_call_id must match. The LLM needs to see:

What tool it decided to call (reasoning chain)
What data came back (to reference in synthesis)
The full context (to make follow-up decisions)

You can't skip the tool internals without breaking the API contract.

Each turn in history includes:

User message (~20 tokens)
Assistant tool_call decision (~30 tokens)
Tool result (~200-400 tokens, depending on response size)
Assistant synthesis (~150 tokens)

Multiply by number of turns. That's your history cost.

The Complete Picture: From 590 to 7,166 Tokens

Phase	Scenario	Tokens	Multiplier	Cost/Year*
Phase 1	Single tool, single query	590	1.0x	$32
Phase 2	6 tools, single query	1,204	2.0x	$66
Phase 3	6 tools, multi-tool workflow	2,910	4.9x	$159
Phase 4	6 tools, 3-turn conversation	7,166	12.1x	$392

*Assumes 1,000 queries/day, 365 days, gpt-4o-mini pricing

The exponential pattern:

Adding 5 tools: 2x cost
Adding 2 workflow iterations: 2.4x cost
Adding 2 conversation turns: 2.5x cost
Compound effect: 12.1x from baseline

Conversation depth costs more than tool quantity.

This isn't obvious until you measure it.

The Scaling Nightmare

Extrapolate to production scale:

Enterprise monitoring agent:

100 tools (network, database, application, infrastructure)
5-turn conversations (realistic troubleshooting session)
50 queries/user/day
100 power users

Token projection:

Tool definitions: 14,000 tokens
Conversation depth: 10,000 tokens (5 iterations avg)
History accumulation: 20,000+ tokens (5 turns)
Total per conversation: ~44,000 tokens

Daily usage: 100 users × 50 queries = 5,000 queries
Daily tokens: 5,000 × 44,000 = 220M tokens
Annual tokens: 220M × 365 = 80.3B tokens

Cost (gpt-4o-mini):
- Input: 80.3B × $0.150/1M = $12,045/year
- Output: 20B × $0.600/1M = $12,000/year
Total: $24,045/year minimum

Cost (gpt-4):
- Input: 80.3B × $2.50/1M = $200,750/year
- Output: 20B × $10/1M = $200,000/year
Total: $400,750/year

And this is JUST token costs. Not infrastructure, engineering, support, or training data.

At 1,000 users: $2.4M/year (gpt-4o-mini) or $40M/year (gpt-4).

Token management isn't a nice-to-have. It's a fundamental cost driver.

What Production Systems Do (And Their Trade-offs)

Every AI company faces this. Here's what they do:

1. Summarization (OpenAI, Anthropic)

Strategy: After N turns, replace old messages with a summary.

Example:

Turn 1-5: [full messages] - 10,000 tokens
Becomes: [summary] - 500 tokens

Trade-offs:

✅ Massive token savings (20x compression)
❌ Loses detail (can't reference specific data points)
❌ Summarization can hallucinate or miss nuance
❌ Adds latency (extra LLM call for summarization)

2. Sliding Window (Common Pattern)

Strategy: Keep only last N turns, drop the rest.

Example:

Conversation with 10 turns
Keep: Turn 8, 9, 10
Drop: Turn 1-7

Trade-offs:

✅ Simple to implement
✅ Predictable token usage
❌ Can't reference old context ("Remember that device from Turn 3?")
❌ Breaks long troubleshooting sessions

3. Semantic Compression (Advanced)

Strategy: Analyze conversation, identify essential messages, drop irrelevant ones.

Example:

Turn 1: "Show device metrics" → Keep (context for Turn 2)
Turn 2: "What about neighbors?" → Keep (context for Turn 3)
Turn 3: "Show uptime" → Keep (most recent)
Turn 4: Independent query → Drop (not referenced later)

Trade-offs:

✅ Optimal token usage (keep only what's needed)
✅ Maintains coherence for linked context
❌ Complex logic (requires NLP analysis)
❌ Can make mistakes (drop something that's referenced later)
❌ Engineering overhead

4. RAG for Long Conversations (Enterprise)

Strategy: Store conversation in vector database, retrieve relevant snippets on demand.

Example:

Full conversation: 50 turns in vector DB
Current query: "What was that error from earlier?"
Retrieve: Turn 12, 13, 14 (error context)
Send to LLM: Only retrieved turns + current query

Trade-offs:

✅ Scales to very long conversations
✅ Semantic retrieval (finds relevant context)
❌ High engineering complexity
❌ Retrieval can miss context
❌ Adds latency (DB query + embedding)

5. Truncate Tool Results (Our Insight)

Strategy: Keep assistant responses (natural language), drop or compress tool_calls and tool_results.

Example:

Instead of:
{role: "tool", content: "{cpu: 78%, memory: 85%, bandwidth: 920mbps, ...400 tokens}"}

Send:
{role: "tool", content: "Summary: High CPU (78%), memory normal"}

Trade-offs:

✅ 3-5x reduction in history size
✅ Maintains conversational coherence (assistant answers kept)
❌ LLM can't reference raw data ("What was the exact CPU value?")
❌ Requires smart summarization logic

None of these are perfect. Everyone struggles with this.

The industry is actively researching better solutions. But for now, this is the reality.

What We're Testing Next

Phase 3: Execution Optimizations (Tactical)

Parallel tool execution
- Execute independent tools concurrently
- Reduces iterations (3 sequential calls → 1 parallel batch)
- Target: 30-40% token reduction
Smart history truncation
- Keep assistant responses, drop tool internals
- Context-aware (keep turns with pronoun references)
- Target: 3-5x reduction in history size
Tool result summarization
- Compress large JSON responses (timeseries → summary stats)
- Keep raw data in external store, reference by ID
- Target: 2-3x reduction per large tool response

Phase 4: Tool Selection Optimization (Strategic)

The 10x win. This is where it gets interesting.

The problem: 100 tools × 140 tokens = 14,000 tokens per query.

The solution: Don't send all 100 tools. Send the top 5-10 most relevant.

Approaches we'll test:

Semantic routing (vector embeddings)
- Embed tool descriptions in vector space
- Embed user query
- Retrieve top-K most similar tools
- Send only those to LLM
- Target: 14,000 → 1,400 tokens (10x)
Hierarchical tool organization
- Category tools: "network", "database", "application"
- LLM first picks category (1 LLM call)
- Then picks specific tool from category (2nd LLM call)
- Target: 14,000 → 2,000 tokens (7x)
Two-stage LLM (routing + execution)
- Stage 1: Lightweight routing model picks tools (cheap)
- Stage 2: Main model executes with only selected tools
- Target: 14,000 → 1,500 tokens (9x)

Hypothesis: Tool selection optimization is more valuable than conversation compression.

We'll measure and share results.

Key Takeaways

For builders:

Measure before optimizing. You can't improve what you don't understand. Build visibility into your system from day 1.
Token costs are architectural, not incidental. Like database indexing or cache strategy, token management is a fundamental design concern.
Frameworks are great, but understand what they hide. LangChain and LlamaIndex solve real problems. But they abstract away cost mechanics. Know when to use them and when to build custom.
Conversation depth costs more than tool quantity. Adding 5 tools doubled costs. Adding 2 conversation turns tripled them. Multi-turn conversations are exponentially expensive.

For architects:

Budget for 3-5x token growth in production vs prototype. Your PoC that costs $50/month will cost $500-1,000/month at scale. Plan accordingly.
Context window limits are real. gpt-4o-mini has a 128K token context window. At our Phase 4 rate (2,696 tokens/turn), that's ~47 turns before you hit the limit. Then you MUST truncate or summarize.
LLMs are stateless everywhere. ChatGPT, Claude, Gemini—everyone faces this. Conversation history is the only way to maintain context. Design your system with this constraint in mind.
Tool selection > conversation compression (hypothesis to test). At 100 tools, reducing tool definitions from 14K → 1.4K saves more than aggressive history truncation.

For consultants:

This is a differentiator. Most teams don't measure token usage this deeply. They prototype, scale, and then panic when costs explode. Understanding token economics gives you a 5-10x cost advantage.
Cost optimization is strategic, not tactical. Picking gpt-4o-mini over gpt-4 is tactical (3x savings). Semantic tool routing is strategic (10x savings). Both matter, but strategic wins compound.
Token mechanics = AI economics. If you're advising clients on AI adoption, you need to understand this. Token costs are to AI what compute costs are to cloud infrastructure.

Conclusion

I started this investigation because I kept hearing: "LLM costs are manageable if you optimize prompts and pick the right model."

That's true for simple use cases. But for production AI agents with:

Dozens of tools
Multi-step workflows
Multi-turn conversations
Power users running hundreds of queries per day

...prompt optimization is noise. The signal is architectural.

Token costs don't scale linearly. They compound:

Tool definitions (linear)
Conversation depth (exponential)
History accumulation (exponential)

At enterprise scale, this becomes a $100K-$1M/year line item. That's not a rounding error. That's a strategic decision.

The good news: It's solvable. Semantic routing, smart truncation, parallel execution—these aren't exotic techniques. They're engineering problems with known solutions.

But you can't solve what you don't measure.

Build visibility. Measure religiously. Optimize strategically.

That's the difference between an AI prototype and an AI product.

About the author: I'm an independent technical consultant with 15 years of experience building production systems. Currently conducting systematic research into LLM optimization and token economics. Follow along as I share results from other phases of my token research.

Want to discuss token optimization strategies for your AI system? Drop a comment or reach out. I'm always interested in comparing notes with other builders tackling this problem.

Token Explosion in AI Agents: Why Your Costs Scale Exponentially

The Setup: Building an AI Agent from Scratch

Phase 1: The Baseline (Single Tool, Single Query)

Phase 2: Tool Definition Scaling (1 Tool → 6 Tools)

Phase 3: Conversation Depth (Multi-Tool Workflows)

Iteration 1:

Iteration 2:

Iteration 3:

Final synthesis call:

Conversation structure after 3 iterations:

Phase 4: Multi-Turn Conversations (The Real Killer)

Turn-by-turn breakdown:

Turn 1:

Turn 2:

Turn 3:

The Complete Picture: From 590 to 7,166 Tokens

The Scaling Nightmare

What Production Systems Do (And Their Trade-offs)

1. Summarization (OpenAI, Anthropic)

2. Sliding Window (Common Pattern)

3. Semantic Compression (Advanced)

4. RAG for Long Conversations (Enterprise)

5. Truncate Tool Results (Our Insight)

What We're Testing Next

Phase 3: Execution Optimizations (Tactical)

Phase 4: Tool Selection Optimization (Strategic)

Key Takeaways

Conclusion

Comments

Everything About Tokens

Model Selection for AI Agents: Measuring Token Costs Across OpenAI's Model Family

More from this blog

Anatomy of a Prompt — System, User, and Assistant Explained

Choosing Embedding Models and Dimensions: Why 1536 Isn't Always Better Than 384

What Are Embeddings and How Vector Similarity Actually Works

How Tokenization Works: BPE and the Algorithm Behind Your LLM

What Are Tokens and Why Your LLM Bill Depends on Them

Command Palette

The Setup: Building an AI Agent from Scratch

Phase 1: The Baseline (Single Tool, Single Query)

Phase 2: Tool Definition Scaling (1 Tool → 6 Tools)

Phase 3: Conversation Depth (Multi-Tool Workflows)

Iteration 1:

Iteration 2:

Iteration 3:

Final synthesis call:

Conversation structure after 3 iterations:

Phase 4: Multi-Turn Conversations (The Real Killer)

Turn-by-turn breakdown:

Turn 1:

Turn 2:

Turn 3:

The Complete Picture: From 590 to 7,166 Tokens

The Scaling Nightmare

What Production Systems Do (And Their Trade-offs)

1. Summarization (OpenAI, Anthropic)

2. Sliding Window (Common Pattern)

3. Semantic Compression (Advanced)

4. RAG for Long Conversations (Enterprise)

5. Truncate Tool Results (Our Insight)

What We're Testing Next

Phase 3: Execution Optimizations (Tactical)

Phase 4: Tool Selection Optimization (Strategic)

Key Takeaways

Conclusion

Comments

Everything About Tokens

Model Selection for AI Agents: Measuring Token Costs Across OpenAI's Model Family

More from this blog