I've been building an AI agent from scratch. No frameworks, no abstractions, to understand where every token goes and what drives cost at scale. In previous post/phase, I measured how tool definitions and conversation depth impact token usage. The findings were clear: tool definitions scale linearly (~140 tokens per tool), but conversation depth scales exponentially. A three-turn conversation can triple your costs.

This post/phase isolates a different variable: model choice.

Most teams pick models based on vibes or what's currently hyped. "gpt-5 is the newest, so we'll use that." But for AI agents, systems that orchestrate tools rather than solve complex reasoning problems, raw intelligence isn't the bottleneck. Token efficiency is.

I tested six OpenAI models on identical agent workflows:

gpt-4o-mini
gpt-4.1
gpt-5.1
gpt-5-mini
gpt-5
gpt-5-nano.

Same queries, same tools, same system prompt. The only variable: the model.

This post breaks down the measurements, shows where the cost explosion happens, and provides a framework for choosing the right model for production agent systems.

Why Model Choice Matters for Agents

AI agents are fundamentally different from reasoning systems or chatbots. An agent's workflow is:

Parse user intent
Select appropriate tool(s)
Execute tool calls
Synthesize results into natural language

This is tool orchestration, not deep reasoning. The model needs to be reliable, fast, and cheap—not necessarily the smartest in the room.

Yet most organizations default to the latest, most powerful model without measuring whether that power translates to value. The assumption: "Better model = better agent." But for agents, "better" often means "more expensive with no meaningful improvement in output quality."

Token efficiency matters because agent costs compound. A chatbot might handle 100-500 queries per day. An agent in production can hit 10,000+ queries per day easily via monitoring alerts, processing support tickets, analyzing logs, handling API requests. Every extra token multiplies across tens of thousands of daily executions.

Industry pattern: Teams prototype with gpt-4o or gpt-5 because "it works," then ship to production without revisiting model choice. Six months later, the invoice is eye-watering and no one knows why.

Model selection isn't a one-time decision made during prototyping. It's an architecture decision with direct P&L impact.

The Test

I ran identical workflows across six OpenAI models to measure token usage, cost, and latency under controlled conditions.

Test Setup:

Workflow: Single-tool queries (device metrics lookup via function calling)
Query count: 6 identical queries per model
Models tested: gpt-4o-mini, gpt-4.1, gpt-5.1, gpt-5-mini, gpt-5, gpt-5-nano
System prompt: ~200 tokens (identical across all tests)
Tool definitions: 10 tools (~1,360 tokens total)
No optimizations: No caching, no history truncation, no parallel execution

What I measured:

Input tokens (system prompt + tool definitions + user query)
Output tokens (tool call + natural language response)
Total cost per query (using OpenAI's December 2024 pricing)
API response latency (time from request to completion)

Why single-tool baseline matters: Multi-tool workflows and conversation depth add variables that obscure model-specific behavior. A single-tool query isolates how each model handles basic agent orchestration: parse intent → select tool → execute → synthesize response. This is the atomic unit of agent work.

The workflow is simple by design. If model choice creates 10-15x cost differences on simple queries, imagine the impact on complex multi-turn conversations with 5-10 tool calls.

Agent Architecture (Simplified):

def handle_query(user_query, model):
    # Call 1: Intent + Tool Selection
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_query}
    ]

    response_1 = openai.chat.completions.create(
        model=model,
        messages=messages,
        tools=TOOL_DEFINITIONS  # ~1,360 tokens
    )

    # Extract tool call
    tool_call = response_1.choices[0].message.tool_calls[0]

    # Execute tool
    tool_result = execute_tool(tool_call.function.name, 
                               tool_call.function.arguments)

    # Call 2: Synthesize Response
    messages.append(response_1.choices[0].message)
    messages.append({
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": json.dumps(tool_result)
    })

    response_2 = openai.chat.completions.create(
        model=model,
        messages=messages
    )

    return response_2.choices[0].message.content

Every query requires two LLM calls. The first call decides which tool to use. The second call synthesizes the tool result into a natural language response. Token costs accumulate across both calls.

Results & Analysis

The Numbers

Model	Avg Total Tokens	Avg Input	Avg Output	Cost/Query	Relative Cost	Speed
gpt-4o-mini	1,345	1,167	178	$0.000751	1.00x	3.1s
gpt-4.1	1,504	1,312	192	$0.004160	5.54x	4.7s
gpt-5.1	1,476	1,319	157	$0.003219	4.29x	6.1s
gpt-5-mini	2,233	1,397	836	$0.002021	2.69x	14.6s
gpt-5	2,313	1,351	962	$0.011313	15.06x	16.0s
gpt-5-nano	2,747	1,390	1,357	$0.000612	0.81x	13.5s

Pricing (per 1M tokens, Dec 2025):

gpt-4o-mini: $0.40 input / $1.60 output
gpt-4.1: $2.00 input / $8.00 output
gpt-5.1: $1.25 input / $10.00 output
gpt-5-mini: $0.25 input / $2.00 output
gpt-5: $1.25 input / $10.00 output
gpt-5-nano: $0.05 input / $0.40 output

Initial observations:

The gpt-5 family (gpt-5, gpt-5-mini, gpt-5-nano) produces 4-7x more output tokens than gpt-4 models for identical queries. gpt-4o-mini averages 178 output tokens. gpt-5-nano averages 1,357 output tokens—7.6x more verbose.

Input token counts are relatively stable (1,167-1,397 range). The explosion happens in output tokens—the natural language responses synthesized after tool execution.

Speed varies dramatically. gpt-4o-mini completes queries in 3.1 seconds. gpt-5 takes 16 seconds—5x slower for the same workflow.

Cost per query ranges from $0.000612 (gpt-5-nano) to $0.011313 (gpt-5)—an 18x difference.

Output Token Explosion

The gpt-5 family's verbosity isn't a bug. It's a feature optimized for different use cases.

Example: "What are the current metrics for device DEV_ID_123?"

gpt-4o-mini response (178 tokens):

Device DEV_ID_123 (Mumbai Router) is operational. Current metrics:
- CPU: 45%
- Memory: 62%
- Bandwidth: 450 Mbps
- Packet Loss: 0.02%
- Last ping: 2 minutes ago

gpt-5-nano response (1,357 tokens):

I've retrieved the current metrics for device DEV_ID_123, which is 
identified as the Mumbai Router in your network infrastructure. Let me 
break down the detailed performance metrics for you:

**Current Operational Status:**
The device is currently in operational state, meaning all systems are 
functioning within normal parameters and the device is actively handling 
network traffic as expected.

**CPU Utilization:**
The processor is currently running at 45% capacity. This is well within 
the healthy operating range for network devices of this class. CPU usage 
below 70% typically indicates the device has sufficient headroom to handle 
traffic spikes without performance degradation.

**Memory Usage:**
RAM utilization stands at 62% of total available memory. This level is 
normal for a router handling typical enterprise workloads. Memory usage 
becomes concerning only when it consistently exceeds 85%, at which point 
you might see packet drops or routing table processing delays.

**Bandwidth Consumption:**
Current bandwidth utilization is 450 Mbps. To properly contextualize this 
number, I'd need to know the total interface capacity, but this represents 
the current throughput being processed by the device.

**Packet Loss Rate:**
The device is reporting 0.02% packet loss, which is excellent. Industry 
standards consider anything below 1% to be acceptable for most applications, 
and below 0.1% to be ideal. Your current rate indicates a very healthy 
network connection with minimal data loss.

**Connectivity Status:**
The last successful ping response was received 2 minutes ago, confirming 
the device is actively responding to network health checks and is reachable 
via management protocols.

**Overall Assessment:**
Based on these metrics, DEV_ID_123 is performing optimally with no 
immediate concerns requiring attention. All key performance indicators 
are within expected ranges for normal operations.

Both responses contain the same factual data: CPU 45%, Memory 62%, Bandwidth 450 Mbps, Packet Loss 0.02%, Last ping 2 minutes ago.

gpt-4o-mini delivers facts. gpt-5-nano delivers an analysis, contextualization, and an overall assessment. For a chatbot or reasoning system, that extra detail might add value. For an agent synthesizing tool output, it's pure cost overhead.

Why this happens:

The gpt-5 family is trained to provide thorough, well-reasoned responses. When you ask gpt-5 to explain a complex topic or solve a multi-step problem, that verbosity is valuable. When you ask it to format JSON data into a sentence, that same training produces unnecessary elaboration.

gpt-4o-mini is optimized for efficiency. It generates concise responses by default. For agent workflows—where the model's job is to translate structured data into natural language—concise is exactly what you want.

Token cost impact:

At $10/1M output tokens (gpt-5 pricing), producing 1,357 tokens instead of 178 tokens costs an extra $0.01179 per query. That sounds trivial until you multiply it by 10,000 queries per day: $117.90/day = $43,034/year in unnecessary output tokens.

Cost Breakdown by Model

gpt-4o-mini: The baseline ($2,741/year)

Cost per query: $0.000751 At 10,000 queries/day: $2,741/year

This is the efficiency leader. Low input costs ($0.40/1M), reasonable output costs ($1.60/1M), and lean responses (178 tokens avg). Speed is excellent at 3.1s per query—fast enough for real-time agent interactions.

For most agent workflows, gpt-4o-mini hits the sweet spot: cheap, fast, reliable.

gpt-5-nano: The paradox ($2,234/year)

Cost per query: $0.000612 At 10,000 queries/day: $2,234/year

This is technically the cheapest model per query. Input tokens cost $0.05/1M (8x cheaper than gpt-4o-mini), and output tokens cost $0.40/1M (4x cheaper).

But here's the paradox: it produces the most verbose responses (1,357 tokens avg) and has the slowest latency (13.5s). You save $507/year compared to gpt-4o-mini, but you quadruple response time and generate 7.6x more output tokens.

The cost savings come from pricing, not efficiency. If your agent handles batch workloads where latency doesn't matter—overnight report generation, bulk data processing—gpt-5-nano might work. For real-time interactions, the 13.5s wait kills UX.

The verbosity is manageable with strict system prompts. You can tell the model "Respond in 2-3 sentences maximum" and it will comply most of the time. But you're fighting the model's training rather than working with it.

When gpt-5-nano makes sense: Batch processing with no latency requirements and extremely strict output constraints. Otherwise, the $507/year savings isn't worth the operational complexity.

gpt-5-mini: The caching candidate ($7,377/year)

Cost per query: $0.002021 At 10,000 queries/day: $7,377/year

At 2.69x the cost of gpt-4o-mini, this seems like a poor choice. But there's a hidden advantage: prompt caching economics.

OpenAI offers 90% cache discounts on input tokens. For gpt-5-mini, that means cached input tokens cost $0.025/1M instead of $0.25/1M. For gpt-4o-mini, cached tokens cost $0.10/1M instead of $0.40/1M.

If you're caching system prompts and tool definitions (which represent 80-90% of input tokens in agent workflows), gpt-5-mini's cache discount is better than gpt-4o-mini's in absolute terms.

At 90% cache hit rate:

gpt-5-mini cached cost: ~$0.001738/query
gpt-4o-mini cached cost: ~$0.000751/query

gpt-5-mini is still 2.3x more expensive even with caching, but the gap narrows significantly. I'm testing this in next phase to see if the cache hit rates are stable enough to justify the higher base cost.

For now, gpt-5-mini is a "maybe" for high-cache-hit-rate workloads. Default to gpt-4o-mini unless you've measured >90% cache hits and confirmed the cost math works.

gpt-5.1: Less verbose, still expensive ($11,750/year)

Cost per query: $0.003219 At 10,000 queries/day: $11,750/year

gpt-5.1 produces only 157 output tokens—less than gpt-4o-mini. This suggests OpenAI tuned it to be less chatty than gpt-5/gpt-5-mini.

But the pricing structure destroys the efficiency gains. Output tokens cost $10/1M, and even 157 tokens at that rate adds up. Input tokens cost $1.25/1M (3.1x more than gpt-4o-mini).

Latency is 6.1s—2x slower than gpt-4o-mini but faster than other gpt-5 models.

When gpt-5.1 makes sense: It doesn't, for agents. The 4.29x cost premium buys you nothing meaningful in agent workflows. If you need gpt-5-level reasoning, use gpt-5. If you need efficiency, use gpt-4o-mini. gpt-5.1 is stuck in the middle with no clear advantage.

gpt-4.1: Output pricing kills it ($15,184/year)

Cost per query: $0.004160 At 10,000 queries/day: $15,184/year

Output tokens cost $8/1M—5x more than gpt-4o-mini. Even with lean responses (192 tokens avg), the pricing structure makes this uneconomical for agent workflows.

Input tokens cost $2/1M (5x more than gpt-4o-mini), and latency is 4.7s (1.5x slower).

When gpt-4.1 makes sense: If you need slightly better reasoning than gpt-4o-mini for specific complex queries, gpt-4.1 might be viable for a small subset of your traffic. But for bulk agent orchestration, the 5.54x cost premium isn't justified.

gpt-5: The reasoning tax ($41,292/year)

Cost per query: $0.011313 At 10,000 queries/day: $41,292/year

This is 15.06x more expensive than gpt-4o-mini. At scale, that's $38,551/year wasted on capabilities you're not using.

gpt-5 produces 962 output tokens per query—5.4x more than gpt-4o-mini. Output tokens cost $10/1M, which means $0.00962 of the $0.011313 cost is pure output verbosity.

Latency is 16 seconds—5x slower than gpt-4o-mini.

Where gpt-5 excels: Complex reasoning tasks. Multi-step problem solving. Code generation with architectural decisions. Deep analysis where you want the model to "think out loud" and show its work.

Where gpt-5 fails: Tool orchestration. Function calling. Simple data formatting. Any workflow where the model's job is "take this JSON and turn it into a sentence."

The reasoning tax is paying for a Ferrari to deliver pizza. gpt-5's extended thinking and thorough analysis are wasted on "call get_device_metrics() and format the response." You're paying 15x more for capabilities that don't improve output quality in agent contexts.

The Latency Factor

Speed ranges from 3.1s (gpt-4o-mini) to 16s (gpt-5). For real-time agent interactions, this matters.

Real-time UX threshold: ~3-5 seconds

Users tolerate 3-5 second waits for "thinking" tasks. Beyond that, the experience feels sluggish. Chatbots can get away with 8-10 second responses if they're showing typing indicators. Agents—which users expect to be fast, efficient systems—can't.

If your agent is responding to Slack messages, API requests, or monitoring alerts, 16-second latency is unacceptable. Users will assume the system is broken.

When latency doesn't matter:

Batch workloads, overnight processing, background analysis. If you're generating daily reports at 3 AM, no one cares if it takes 3 seconds or 16 seconds per query.

But even in batch scenarios, slower models mean longer total processing time. If you're processing 10,000 queries overnight, gpt-4o-mini completes in 8.6 hours. gpt-5 takes 44.4 hours—almost two full days.

Speed correlates with cost: The fastest model (gpt-4o-mini, 3.1s) is also the cheapest ($0.000751). The slowest model (gpt-5, 16s) is also the most expensive ($0.011313).

This isn't coincidental. More powerful models do more computation per token, which increases both latency and cost.

Where the Money Goes

Input tokens are relatively stable across models (1,167-1,397 range). The cost explosion happens in output tokens.

Output pricing asymmetry:

OpenAI charges significantly more for output tokens than input tokens. For gpt-5, input costs $1.25/1M but output costs $10/1M—an 8x difference.

This makes sense from an infrastructure perspective. Generating tokens requires more computation than processing them. But it also means verbose models get punished hard.

Example cost breakdown (gpt-5 vs gpt-4o-mini):

gpt-5:

Input: 1,351 tokens × $1.25/1M = $0.00169
Output: 962 tokens × $10/1M = $0.00962
Total: $0.01131

gpt-4o-mini:

Input: 1,167 tokens × $0.40/1M = $0.00047
Output: 178 tokens × $1.60/1M = $0.00028
Total: $0.00075

gpt-5's output tokens alone ($0.00962) cost 12.8x more than the entire gpt-4o-mini query ($0.00075).

The verbosity problem isn't just "more tokens." It's "more tokens at 6.25x the unit price" ($10/1M vs $1.60/1M).

Cost scales nonlinearly: If you double output tokens on gpt-4o-mini (178 → 356), cost increases by $0.00028. If you double output tokens on gpt-5 (962 → 1,924), cost increases by $0.00962—34x more expensive per incremental token.

This is why model choice matters. Small differences in verbosity compound into massive cost differences at scale.

The Universal Pattern

This cost explosion pattern applies to all LLM systems, not just OpenAI. Anthropic, Google, Deepseek, Grok—every provider charges more for output than input, and every model family has verbose variants optimized for reasoning rather than efficiency.

The fundamental trade-off:

You can optimize models for intelligence (reasoning, analysis, thoroughness) or efficiency (speed, cost, conciseness). You can't have both.

gpt-5 is optimized for intelligence. It's trained to provide detailed, well-reasoned responses. When you ask it to solve a complex problem, that training is valuable. When you ask it to format JSON into a sentence, that same training produces unnecessary elaboration.

gpt-4o-mini is optimized for efficiency. It generates concise responses by default. For agent workflows, where the model's job is translation rather than reasoning, concise is what you want.

The "reasoning tax":

Using gpt-5 for agent workflows is like hiring a neurosurgeon to take your temperature. The expertise is real, but it's overkill for the task. You're paying for capabilities you don't need.

At 10,000 queries/day, the reasoning tax costs $38,551/year. That's a mid-level engineer's salary wasted on output verbosity.

Why teams make this mistake:

Prototyping with the "best" model: During development, you test with gpt-5 because "we want the best results." The prototype works great. You ship to production without revisiting model choice.
No measurement culture: Most teams don't measure token costs per query. They see the monthly invoice and assume "LLMs are expensive." But a 10x cost difference between models is invisible without per-query metrics.
Confusing reasoning with reliability: Teams assume "smarter model = fewer errors." But for agent workflows, errors come from ambiguous tool definitions or poor error handling, not lack of model intelligence. gpt-4o-mini is just as reliable as gpt-5 for "select the right tool and format the response."
Sunk cost fallacy: Once you've built your agent on gpt-5, switching models feels risky. "What if gpt-4o-mini breaks our edge cases?" So teams stick with expensive models rather than testing cheaper alternatives.

Model selection as an architecture decision:

Model choice impacts:

Annual operating costs (5-15x difference)
Real-time latency (3-5x difference)
Throughput capacity (faster models = more queries/second)
Error recovery costs (verbose models generate more tokens during retries)

This isn't a detail you can ignore. It's a first-order concern that belongs in architecture reviews, not buried in implementation details.

When gpt-5 makes sense (not agents):

Use gpt-5 for:

Complex reasoning tasks (multi-step analysis, code generation with architectural decisions)
Exploratory work where you want the model to "think out loud"
High-value, low-volume queries where cost per query doesn't matter
Tasks where verbosity adds value (detailed explanations, teaching, tutoring)

Don't use gpt-5 for:

Tool orchestration (agent workflows)
Simple data formatting
High-volume, low-complexity queries
Real-time interactions where latency matters

Production Recommendations

Default Choice: gpt-4o-mini

For most agent workflows, gpt-4o-mini is the right choice.

Why it wins:

$2,741/year at 10,000 queries/day (baseline cost)
3.1s average latency (fast enough for real-time UX)
178 output tokens average (lean, no fluff)
Reliable tool selection and response formatting
Best cost/performance ratio across all tested models

When to stick with gpt-4o-mini:

Real-time agent interactions (Slack bots, API endpoints, monitoring alerts)
High-volume workflows (>1,000 queries/day)
Straightforward tool orchestration (select tool → execute → format response)
Budget-conscious deployments

Cost projections:

Queries/Day	Annual Cost
1,000	$274
5,000	$1,371
10,000	$2,741
50,000	$13,706
100,000	$27,412

At 100K queries/day (enterprise scale), gpt-4o-mini costs $27,412/year. gpt-5 would cost $413,000/year—a $385,588 difference.

Exception Case: gpt-5-mini with Prompt Caching

If you're implementing prompt caching and achieving 90%+ cache hit rates, gpt-5-mini becomes interesting.

Why caching changes the math:

Cached input tokens for gpt-5-mini cost $0.025/1M (90% discount from $0.25/1M). Cached input tokens for gpt-4o-mini cost $0.10/1M (75% discount from $0.40/1M).

In absolute terms, gpt-5-mini's cached rate is 4x cheaper than gpt-4o-mini's cached rate.

For agent workflows where 80-90% of input tokens are cacheable (system prompt + tool definitions), this narrows the cost gap significantly.

At 90% cache hit rate:

gpt-5-mini: ~$0.001738/query → $6,344/year at 10K queries/day
gpt-4o-mini: ~$0.000539/query → $1,967/year at 10K queries/day

gpt-5-mini is still 3.2x more expensive, and you're still dealing with verbose outputs (836 tokens avg) and slower latency (14.6s).

When gpt-5-mini might work:

Proven 90%+ cache hit rates in production
Batch workloads where 14.6s latency is acceptable
Strict output constraints to manage verbosity

My take: Test it in Phase 3.2, but don't assume it's better. The cache math looks good on paper, but operational complexity and latency trade-offs might not be worth the savings.

Budget-Critical Scenario: gpt-5-nano

If cost is your absolute top priority and latency doesn't matter, gpt-5-nano is the cheapest option at $0.000612/query ($2,234/year at 10K queries/day).

The trade-offs:

13.5s average latency (4.4x slower than gpt-4o-mini)
1,357 output tokens average (7.6x more verbose)
Requires strict system prompts to control verbosity
Not suitable for real-time interactions

When gpt-5-nano makes sense:

Overnight batch processing (reports, analysis, bulk data formatting)
Internal tools where speed doesn't impact user experience
Extremely cost-constrained deployments ($507/year savings vs gpt-4o-mini)

When it doesn't:

Real-time agent interactions (13.5s is too slow)
Any workflow where users expect <5s response times
High-complexity queries where verbose outputs become unmanageable

My take: The $507/year savings isn't worth the operational complexity for most teams. Stick with gpt-4o-mini unless you have a specific batch workload where latency truly doesn't matter.

Never Use for Agents: gpt-5 and gpt-4.1

gpt-5: $41,292/year (15x more than gpt-4o-mini)

This is a fantastic model for reasoning tasks. It's terrible for agent workflows.

You're paying $38,551/year for capabilities you don't need. The extended thinking and thorough analysis are wasted on "call this tool and format the response."

When to use gpt-5: Complex reasoning, code generation with architectural decisions, exploratory analysis. Not agents.

gpt-4.1: $15,184/year (5.5x more than gpt-4o-mini)

Output token pricing ($8/1M) makes this uneconomical even with lean responses. There's no compelling reason to use gpt-4.1 over gpt-4o-mini for agent workflows.

When to use gpt-4.1: If you need slightly better reasoning than gpt-4o-mini for specific edge cases, you might route 5-10% of traffic to gpt-4.1. But default to gpt-4o-mini.

Cost Scaling Across Query Volumes

Annual cost comparison (10,000 queries/day):

Model	Annual Cost	vs gpt-4o-mini
gpt-5-nano	$2,234	-19%
gpt-4o-mini	$2,741	baseline
gpt-5-mini	$7,377	+169%
gpt-5.1	$11,750	+329%
gpt-4.1	$15,184	+454%
gpt-5	$41,292	+1,406%

At 50,000 queries/day (mid-size enterprise):

Model	Annual Cost	vs gpt-4o-mini
gpt-5-nano	$11,169	-19%
gpt-4o-mini	$13,706	baseline
gpt-5-mini	$36,883	+169%
gpt-5.1	$58,747	+329%
gpt-4.1	$75,920	+454%
gpt-5	$206,458	+1,406%

At this scale, choosing gpt-5 over gpt-4o-mini costs $192,752/year. That's two senior engineers' salaries.

At 100,000 queries/day (large enterprise):

Model	Annual Cost	vs gpt-4o-mini
gpt-5-nano	$22,338	-19%
gpt-4o-mini	$27,412	baseline
gpt-5-mini	$73,767	+169%
gpt-5.1	$117,494	+329%
gpt-4.1	$151,840	+454%
gpt-5	$412,915	+1,406%

The cost difference becomes a line item on the P&L. CFOs will ask why you're spending $385K/year on LLM costs when competitors are spending $27K.

Key Takeaways

For Builders

Measure before committing to a model.

Don't assume the latest model is the best model. gpt-5 is incredible for reasoning tasks. It's wasteful for agent workflows.

Run your own tests. Query patterns, tool complexity, and system prompts all impact token usage. The numbers in this post are from my specific workflow—yours will differ. But the pattern (verbose models cost more) is universal.

Agent workflows need efficiency, not reasoning power.

Your agent's job is tool orchestration: select the right tool, execute it, format the response. That doesn't require extended thinking or deep analysis. It requires reliability, speed, and cost efficiency.

gpt-4o-mini handles agent workflows just as reliably as gpt-5, but 15x cheaper and 5x faster.

Output verbosity compounds at scale.

A 1,000-token difference in output per query seems small. At 10,000 queries/day, it's 10 million tokens/day = 300 million tokens/month. At $10/1M (gpt-5 output pricing), that's $3,000/month in unnecessary verbosity.

Watch your output token counts. If you're seeing 500+ tokens per response for simple queries, you're either using the wrong model or your system prompt needs tightening.

For Architects

Model selection impacts annual budget by 5-15x.

This isn't a minor optimization. Choosing gpt-5 over gpt-4o-mini for agent workflows can cost $38,551/year at moderate scale (10K queries/day). At enterprise scale (100K queries/day), the difference is $385,588/year.

Model choice belongs in architecture reviews, not buried in implementation details.

Consider latency requirements early.

Real-time agents need <5s response times. gpt-5's 16s latency is unacceptable for Slack bots, API endpoints, or monitoring alerts.

If your agent needs to respond in real-time, eliminate gpt-5 from consideration immediately. Test gpt-4o-mini, gpt-5.1, and maybe gpt-4.1 if you need slightly better reasoning.

Plan for caching early (it changes economics).

Prompt caching can reduce costs by 40-50% if you're caching system prompts and tool definitions. But caching benefits vary by model due to different cache discount rates.

Test caching strategies early in development, not after you've shipped to production with 100K queries/day.

For Engineering Leaders

$40K/year difference at moderate scale.

At 10,000 queries/day, gpt-5 costs $41,292/year. gpt-4o-mini costs $2,741/year. That's $38,551/year wasted on capabilities you're not using.

Ask your team: "What model are we using for agent workflows, and have we measured alternatives?" If the answer is "gpt-5" or "gpt-4," challenge it. If they haven't measured alternatives, make them.

Model choice is not a one-time decision.

OpenAI ships new models every quarter. Pricing changes. Your query patterns evolve. What was optimal six months ago might not be optimal today.

Build model selection into your quarterly reviews. Measure token costs per query. Compare models. Switch if the math improves.

Build measurement into your agent platform.

You can't optimize what you don't measure. Log input tokens, output tokens, cost per query, and latency for every request. Track these metrics over time.

If your monthly LLM invoice is growing but you don't know which queries are expensive or which models are wasteful, you're flying blind.

Instrument your agent platform from day one. Future you will thank present you.

What's Next

My next phase tests prompt caching. OpenAI claims 90% cost savings on cached inputs, with cache discounts varying by model (50-90% depending on the model).

I'm measuring:

Cache hit rates in production-like scenarios
Cost reduction across different models
Whether cached costs change the model selection math

If caching delivers on the promise, it might make gpt-5-mini viable for specific workloads. Or it might just make gpt-4o-mini even cheaper.

Next post: "Prompt Caching for AI Agents: Testing OpenAI's 90% Cost Reduction Claim"

Building an AI agent from scratch to understand token economics. All experiments, code, and data published as I go.

Command Palette