OpenAI Prompt Caching: Undocumented Cross-Model Behavior and Production Cost Implications

I'm building an AI agent from scratch—no frameworks, no abstractions—specifically to understand where every token goes and how much it costs. This is Phase 3 of my token economics research.
Phase 1 covered basic tool calling mechanics. Phase 2 revealed how conversation history causes exponential token growth—adding two conversation turns tripled costs compared to adding five tools.
Phase 3 focuses on LLM-native optimizations: techniques built into the model provider's infrastructure.
First up: OpenAI's automatic prompt caching.
I tested prompt caching across gpt-4o-mini, gpt-5-mini, and gpt-5 with a 10-tool agent. The documented behavior worked as expected. But I also discovered something that isn't in OpenAI's documentation: cache sharing across model generations.
Here's what I measured, how I reproduced it, and when it matters.
How Prompt Caching Works
Every LLM call reprocesses your entire prompt from scratch. System instructions, tool definitions, conversation history—all of it gets tokenized and processed every single time.
Prompt caching changes this. Once your prompt prefix exceeds 1024 tokens, OpenAI automatically caches the processed representation. Subsequent calls with the same prefix reuse the cached computation.
What gets cached:
System message
Tool definitions (the
toolsarray)Initial messages in the conversation
What doesn't get cached:
New user messages
Assistant responses
Tool results
The cache is prefix-based. OpenAI identifies the longest matching prefix starting from the beginning of your prompt and caches it in 128-token increments after the first 1024 tokens.
Cache retention:
Typical: 5-10 minutes of inactivity
Maximum: 1 hour
Organization-scoped (shared across API calls using the same key)
Discount structure:
gpt-4o-mini: 50% off cached input tokens
gpt-5-mini: 90% off cached input tokens
gpt-5: 90% off cached input tokens
The discount applies automatically. You don't need to change your API calls. The cached token count appears in response.usage.prompt_tokens_details.cached_tokens.
Caching is invisible until you log it. Most developers don't even know it's happening.
Test 1: Single Model Cache Behavior
I started by confirming the documented behavior. My test agent has 10 tools and an expanded system prompt totaling 1,360-1,444 tokens (depending on model tokenization).
I ran 10 identical queries per model, logging prompt_tokens and cached_tokens from each response.
Results:
| Model | Cache Hit Rate | Tokens Cached | Cost Reduction |
| gpt-4o-mini | 80% (8/10 runs) | 1,280/1,360 | 47% |
| gpt-5-mini | 90% (9/10 runs) | 1,408/1,444 | 49% |
| gpt-5 | 90% (9/10 runs) | 1,408/1,444 | 49% |
The first call is always a cache miss—nothing is cached yet. Subsequent calls hit the cache 80-90% of the time. The misses are probabilistic (server routing, cache eviction).
Code to log cached tokens:
response = client.chat.completions.create(
model="gpt-5-mini",
messages=[...],
tools=[...]
)
prompt_tokens = response.usage.prompt_tokens
cached_tokens = response.usage.prompt_tokens_details.cached_tokens
cache_percent = (cached_tokens / prompt_tokens * 100) if prompt_tokens > 0 else 0
print(f"Cached: {cached_tokens}/{prompt_tokens} ({cache_percent:.1f}%)")
The 47-49% cost reduction is real. For sustained workloads with repeated prefixes, this is automatic savings with zero code changes.
Test 2: Tool Definition Tokenization
Before running the cache tests, I needed to expand my prefix above the 1024-token threshold. I started with 6 tools (~900 tokens). Adding 4 more tools should have pushed me well over.
I estimated ~400-500 additional tokens based on the JSON size.
Actual result: 56 tokens.
The raw JSON for 10 tool definitions is 6,200 characters. Using a naive estimate of 4 characters per token gives ~1,550 tokens. OpenAI reported 956 tokens for the tools alone.
OpenAI is clearly doing aggressive compression on function schemas. Fields like type, properties, required, additionalProperties likely have special handling—they're repeated across every tool definition.
Implication: Don't avoid adding tools because you're worried about token costs. The overhead is far lower than you'd calculate from JSON character count. My 4 new tools added only 14 tokens each on average.
This matters when you're deciding between one complex tool that handles multiple cases versus multiple specialized tools. The token cost of splitting tools is minimal.
Test 3: Cross-Model Cache Sharing
This is the interesting part.
I wanted to know: does the cache persist across model boundaries? If I call gpt-4o-mini first, will gpt-5-mini benefit from its warm cache?
Test Design:
I ran two phases with three model orderings each:
Phase 1: Same prefix for all models
Order A: gpt-4o-mini → gpt-5-mini → gpt-5
Order B: gpt-5-mini → gpt-5 → gpt-4o-mini
Order C: gpt-5 → gpt-4o-mini → gpt-5-mini
Expected behavior: Model 1 gets cache miss (cold start). Models 2 and 3 get cache hits.
Phase 2: Different prefix for Model 1, same for Models 2-3
Same orderings
Model 1 uses a shortened system prompt (different prefix)
Models 2 and 3 use the full standard prompt
Expected behavior: Model 1 gets cache miss (different prompt). Models 2 and 3 get cache hits from each other.
I waited 10 seconds between orderings to let cache state settle. I waited 5 seconds between models within each ordering.
Results - Phase 1:
| Order | Model 1 | Model 2 | Model 3 |
| A (4o→5m→5) | MISS | HIT | MISS |
| B (5m→5→4o) | HIT | HIT | HIT |
| C (5→4o→5m) | HIT | HIT | HIT |
Order A is the clean proof. gpt-4o-mini runs first with a cold cache. gpt-5-mini immediately gets a cache hit. The only explanation: gpt-5-mini reused the cache warmed by gpt-4o-mini.
Orders B and C show Model 1 hitting cache—this is because the cache from Order A hadn't evicted yet. But the key finding is in Order A.
Results - Phase 2:
| Order | Model 1 (diff) | Model 2 (std) | Model 3 (std) |
| A (4o→5m→5) | MISS | HIT | MISS |
| B (5m→5→4o) | MISS | MISS | HIT |
| C (5→4o→5m) | MISS | MISS | HIT |
Again, Order A proves the point. Model 1 (gpt-4o-mini) uses a different prefix—cache miss. Model 2 (gpt-5-mini) uses the standard prefix and gets a cache hit from... where? Model 1 didn't cache the standard prefix.
The answer: gpt-5-mini is hitting the cache from Phase 1, Order A. The cache persisted for ~2 minutes between phases.

Box 1: gpt-4o-mini call (cached_tokens: 0)
Arrow down: Cache writes prefix
Box 2: gpt-5-mini call (cached_tokens: 1408)
Label: "Same prefix, different model, cache hit"
The pattern is consistent across both phases. When gpt-4o-mini runs first, gpt-5-mini benefits from its cache.
What's Actually Being Shared
Before someone pedantically corrects me: this is prefix-processing cache sharing, not KV-cache sharing.
The models share:
Tokenization pipeline
Prefix normalization
Cache key hashing
They do not share transformer attention states. That's architecturally impossible—gpt-4o-mini and gpt-5 have different layer counts, hidden dimensions, and weight matrices. Their KV caches are mathematically incompatible.
What OpenAI has built is a shared prefix-processing layer that sits in front of the model-specific forward pass. When you call gpt-5-mini after gpt-4o-mini with the same prefix, the prefix-processing layer says "I've already tokenized and normalized this 1,400-token prefix—here it is" and hands it to gpt-5-mini's model.
From a billing perspective, it doesn't matter. Cached tokens are cached tokens. The 90% discount applies either way.
Why gpt-5 showed inconsistency:
In both Order A tests, gpt-5 missed cache even though gpt-5-mini hit it. I ran this multiple times—the pattern held. gpt-5 is less consistent at hitting shared cache.
My hypothesis: gpt-5 is a reasoning model with different prefix handling. It may do additional processing on the prefix that breaks cache key matching. Or it routes to different servers. I don't have enough data to say definitively, but gpt-5-mini is the most reliable for cross-model cache benefits.
Production Cost Implications
Cross-model cache sharing matters when you have high cold-start rates. If your cache stays warm naturally (sustained traffic, same prefix), cross-model warming adds minimal value.
But if you're starting many separate sessions, the savings compound fast.
Scenario: 1,000 cold starts per day
Assume:
10,000 token system prompt (large tool set, detailed instructions)
1,000 separate user sessions per day (different contexts, each needs cache warmup)
Primary model: gpt-5 ($1.25/1M input tokens)
Without cross-model warming:
Each session's first call pays the full 10K token cost:
Per session: 10,000 tokens × $1.25/1M = $0.0125
Daily: 1,000 × $0.0125 = $12.50
Annual: $4,562
With gpt-5-nano warming first:
Each session warms with gpt-5-nano ($0.05/1M input tokens), then calls gpt-5:
Nano warmup: 10,000 tokens × $0.05/1M = $0.0005
gpt-5 call: 10,000 tokens × $0.125/1M (90% cached) = $0.00125
Total per session: $0.00175
Daily: 1,000 × $0.00175 = $1.75
Annual: $639
Savings: $3,923/year (86% reduction on warmup costs)
Scale this to gpt-5-pro ($15/1M input tokens):
Without warming: $54,750/year
With nano warming: $639/year
Savings: $54,111/year
Scale to 100,000 calls/day with the same 10K prefix:
Without warming: $456,250/year
With nano warming: $63,875/year
Savings: $392,375/year
Cost Comparison Table
| Calls/Day | Target Model | Without Warming | With Nano Warming | Annual Savings |
| 1,000 | gpt-5 | $4,562 | $639 | $3,923 |
| 1,000 | gpt-5-pro | $54,750 | $639 | $54,111 |
| 100,000 | gpt-5 | $456,250 | $63,875 | $392,375 |
These numbers assume every call is a cold start. In practice, you'll have some natural cache retention. But the principle holds: for systems with high session turnover, explicit cache warming with a cheap model saves real money.
When this matters:
High cold-start rate (many separate sessions/contexts per day)
Large prefixes (10K+ tokens)
Expensive target model (gpt-5, gpt-5-pro)
Cost-sensitive production systems
When this doesn't matter:
Sustained single-model traffic (cache stays warm naturally)
Small prefixes (<2K tokens—savings too small vs added latency)
Latency-critical paths (extra API call adds 100-500ms)
Implementation Strategy
The simplest approach: call the cheap model first, wait for the response (confirms cache is warm), then call the expensive model.
Pseudocode:
def warm_then_call(prefix_messages, tools, target_model="gpt-5"):
"""
Warm cache with cheap model, then call expensive model.
"""
# Warm cache with gpt-5-nano
warmup_response = client.chat.completions.create(
model="gpt-5-nano",
messages=prefix_messages,
tools=tools,
max_tokens=1 # We don't care about output, just warming
)
# Confirm cache was created
# (In production, you'd log this for monitoring)
# Now call target model - should hit warm cache
response = client.chat.completions.create(
model=target_model,
messages=prefix_messages + [user_message], # Add user query
tools=tools
)
# Check if cache hit occurred
cached = response.usage.prompt_tokens_details.cached_tokens
total = response.usage.prompt_tokens
print(f"Cache hit: {cached}/{total} tokens")
return response
Tradeoffs:
Adding a warmup call costs:
Extra API call (nano is cheap but not free)
Added latency (100-500ms for the warmup call)
The latency matters. For interactive user-facing applications, an extra 200ms is noticeable. For batch processing or background jobs, it's irrelevant.
When nano-first makes sense:
Prefix > 5K tokens (savings outweigh warmup cost)
Target model is expensive (gpt-5, gpt-5-pro)
Latency tolerance > 200ms
When it doesn't:
Small prefixes (< 2K tokens—warmup cost ≈ savings)
Latency-critical paths
Sustained traffic (cache stays warm anyway)
Monitoring:
Track cached_tokens in your logs. Calculate cache hit rate:
cache_hit_rate = cached_calls / total_calls
If you're seeing < 50% hit rate, investigate:
Is your prefix changing between calls?
Are you exceeding cache retention time (5-10 min idle)?
Is traffic bursty enough that cache evicts between calls?
Limitations and Caveats
This behavior is not officially documented. OpenAI's docs mention prompt caching but don't specify cross-model sharing. I discovered it empirically.
What this means:
Behavior could change without notice
OpenAI might intentionally disable cross-model sharing
Future model releases might not share the same pipeline
Other limitations:
Cache eviction is unpredictable. The 5-10 minute guideline is approximate. During high load, caches evict faster. During low load, they persist longer.
Hit rate is probabilistic. I saw 80-90% in tests, not 100%. Server routing, load balancing, and cache state all affect whether you hit cache.
Organization-scoped. Cache is tied to your API key. Different organizations don't share cache (obviously), but even different keys within the same org won't share.
Byte-for-byte prefix matching. A single character difference in your system prompt breaks the cache. Even whitespace matters.
Extra API call adds latency. Nano is fast, but it's still a round trip. For latency-sensitive paths, this may outweigh cost savings.
gpt-5 showed lower consistency. In my tests, gpt-5 missed cache more often than gpt-5-mini. If your target model is gpt-5, test thoroughly before assuming reliable cache hits.
Treat this as an optimization for specific workloads, not a universal best practice. Measure your own hit rates before committing to a warmup strategy.
Reproduction Steps
If you want to verify this yourself:
Requirements:
OpenAI API key
System prompt + tools totaling > 1024 tokens
Test procedure:
Create a prompt with at least 1024 tokens. Use a detailed system message or add several tool definitions.
Call gpt-4o-mini three times with identical prefix. Log
cached_tokensfrom each response.Wait 5 seconds.
Call gpt-5-mini with the same prefix. Check
cached_tokenson the first call.If
cached_tokens > 0on gpt-5-mini's first call, you've confirmed cross-model cache sharing.
Minimal test script:
import openai
import time
client = openai.OpenAI(api_key="your-key")
messages = [
{"role": "system", "content": "Your 1024+ token system prompt here..."},
{"role": "user", "content": "Test query"}
]
tools = [...] # Your tool definitions
# Call 1: gpt-4o-mini
response1 = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=tools
)
print(f"4o-mini: {response1.usage.prompt_tokens_details.cached_tokens} cached")
time.sleep(5)
# Call 2: gpt-5-mini
response2 = client.chat.completions.create(
model="gpt-5-mini",
messages=messages,
tools=tools
)
print(f"5-mini: {response2.usage.prompt_tokens_details.cached_tokens} cached")
Expected output:
4o-mini: 0 cached
5-mini: 1408 cached
If gpt-5-mini shows cached tokens on its first call, you've reproduced the finding.
Key Takeaways
Cross-model cache sharing exists. It's not documented, but it's measurable and reproducible. gpt-4o-mini, gpt-5-mini, and gpt-5 share a prefix-processing cache at the organization level.
The cost impact scales with cold starts. For sustained traffic with natural cache warmth, cross-model warming adds little. For high session turnover (1,000+ cold starts/day), explicit nano-warming can save $4K-$400K/year depending on target model and prefix size.
Tool definitions are heavily compressed. Don't avoid adding tools for token concerns. OpenAI's schema compression means the overhead is far lower than JSON character count suggests.
Measurement beats assumption. Token economics requires logging every call, tracking cached_tokens, and calculating actual costs. The only way to know if an optimization works is to measure it in your specific workload.
This is Phase 3 of ongoing research. Next up: structured outputs (eliminating retry loops), reasoning effort control (gpt-5 token/quality tradeoff), and batch API (50% cost reduction with 24-hour latency). Each technique gets tested with real numbers, not theory.
If you're building production LLM systems, log your token usage. The optimizations aren't obvious until you see where the tokens actually go.
Building this agent from scratch—no frameworks, full visibility—specifically to understand token costs at every layer. All experiments, code, and data published as I go.






