Skip to main content

Command Palette

Search for a command to run...

OpenAI Prompt Caching: Undocumented Cross-Model Behavior and Production Cost Implications

Updated
12 min read
OpenAI Prompt Caching: Undocumented Cross-Model Behavior and Production Cost Implications

I'm building an AI agent from scratch—no frameworks, no abstractions—specifically to understand where every token goes and how much it costs. This is Phase 3 of my token economics research.

Phase 1 covered basic tool calling mechanics. Phase 2 revealed how conversation history causes exponential token growth—adding two conversation turns tripled costs compared to adding five tools.

Phase 3 focuses on LLM-native optimizations: techniques built into the model provider's infrastructure.

First up: OpenAI's automatic prompt caching.

I tested prompt caching across gpt-4o-mini, gpt-5-mini, and gpt-5 with a 10-tool agent. The documented behavior worked as expected. But I also discovered something that isn't in OpenAI's documentation: cache sharing across model generations.

Here's what I measured, how I reproduced it, and when it matters.


How Prompt Caching Works

Every LLM call reprocesses your entire prompt from scratch. System instructions, tool definitions, conversation history—all of it gets tokenized and processed every single time.

Prompt caching changes this. Once your prompt prefix exceeds 1024 tokens, OpenAI automatically caches the processed representation. Subsequent calls with the same prefix reuse the cached computation.

What gets cached:

  • System message

  • Tool definitions (the tools array)

  • Initial messages in the conversation

What doesn't get cached:

  • New user messages

  • Assistant responses

  • Tool results

The cache is prefix-based. OpenAI identifies the longest matching prefix starting from the beginning of your prompt and caches it in 128-token increments after the first 1024 tokens.

Cache retention:

  • Typical: 5-10 minutes of inactivity

  • Maximum: 1 hour

  • Organization-scoped (shared across API calls using the same key)

Discount structure:

  • gpt-4o-mini: 50% off cached input tokens

  • gpt-5-mini: 90% off cached input tokens

  • gpt-5: 90% off cached input tokens

The discount applies automatically. You don't need to change your API calls. The cached token count appears in response.usage.prompt_tokens_details.cached_tokens.

Caching is invisible until you log it. Most developers don't even know it's happening.


Test 1: Single Model Cache Behavior

I started by confirming the documented behavior. My test agent has 10 tools and an expanded system prompt totaling 1,360-1,444 tokens (depending on model tokenization).

I ran 10 identical queries per model, logging prompt_tokens and cached_tokens from each response.

Results:

ModelCache Hit RateTokens CachedCost Reduction
gpt-4o-mini80% (8/10 runs)1,280/1,36047%
gpt-5-mini90% (9/10 runs)1,408/1,44449%
gpt-590% (9/10 runs)1,408/1,44449%

The first call is always a cache miss—nothing is cached yet. Subsequent calls hit the cache 80-90% of the time. The misses are probabilistic (server routing, cache eviction).

Code to log cached tokens:

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[...],
    tools=[...]
)

prompt_tokens = response.usage.prompt_tokens
cached_tokens = response.usage.prompt_tokens_details.cached_tokens
cache_percent = (cached_tokens / prompt_tokens * 100) if prompt_tokens > 0 else 0

print(f"Cached: {cached_tokens}/{prompt_tokens} ({cache_percent:.1f}%)")

The 47-49% cost reduction is real. For sustained workloads with repeated prefixes, this is automatic savings with zero code changes.


Test 2: Tool Definition Tokenization

Before running the cache tests, I needed to expand my prefix above the 1024-token threshold. I started with 6 tools (~900 tokens). Adding 4 more tools should have pushed me well over.

I estimated ~400-500 additional tokens based on the JSON size.

Actual result: 56 tokens.

The raw JSON for 10 tool definitions is 6,200 characters. Using a naive estimate of 4 characters per token gives ~1,550 tokens. OpenAI reported 956 tokens for the tools alone.

OpenAI is clearly doing aggressive compression on function schemas. Fields like type, properties, required, additionalProperties likely have special handling—they're repeated across every tool definition.

Implication: Don't avoid adding tools because you're worried about token costs. The overhead is far lower than you'd calculate from JSON character count. My 4 new tools added only 14 tokens each on average.

This matters when you're deciding between one complex tool that handles multiple cases versus multiple specialized tools. The token cost of splitting tools is minimal.


Test 3: Cross-Model Cache Sharing

This is the interesting part.

I wanted to know: does the cache persist across model boundaries? If I call gpt-4o-mini first, will gpt-5-mini benefit from its warm cache?

Test Design:

I ran two phases with three model orderings each:

Phase 1: Same prefix for all models

  • Order A: gpt-4o-mini → gpt-5-mini → gpt-5

  • Order B: gpt-5-mini → gpt-5 → gpt-4o-mini

  • Order C: gpt-5 → gpt-4o-mini → gpt-5-mini

Expected behavior: Model 1 gets cache miss (cold start). Models 2 and 3 get cache hits.

Phase 2: Different prefix for Model 1, same for Models 2-3

  • Same orderings

  • Model 1 uses a shortened system prompt (different prefix)

  • Models 2 and 3 use the full standard prompt

Expected behavior: Model 1 gets cache miss (different prompt). Models 2 and 3 get cache hits from each other.

I waited 10 seconds between orderings to let cache state settle. I waited 5 seconds between models within each ordering.

Results - Phase 1:

OrderModel 1Model 2Model 3
A (4o→5m→5)MISSHITMISS
B (5m→5→4o)HITHITHIT
C (5→4o→5m)HITHITHIT

Order A is the clean proof. gpt-4o-mini runs first with a cold cache. gpt-5-mini immediately gets a cache hit. The only explanation: gpt-5-mini reused the cache warmed by gpt-4o-mini.

Orders B and C show Model 1 hitting cache—this is because the cache from Order A hadn't evicted yet. But the key finding is in Order A.

Results - Phase 2:

OrderModel 1 (diff)Model 2 (std)Model 3 (std)
A (4o→5m→5)MISSHITMISS
B (5m→5→4o)MISSMISSHIT
C (5→4o→5m)MISSMISSHIT

Again, Order A proves the point. Model 1 (gpt-4o-mini) uses a different prefix—cache miss. Model 2 (gpt-5-mini) uses the standard prefix and gets a cache hit from... where? Model 1 didn't cache the standard prefix.

The answer: gpt-5-mini is hitting the cache from Phase 1, Order A. The cache persisted for ~2 minutes between phases.

  • Box 1: gpt-4o-mini call (cached_tokens: 0)

  • Arrow down: Cache writes prefix

  • Box 2: gpt-5-mini call (cached_tokens: 1408)

  • Label: "Same prefix, different model, cache hit"

The pattern is consistent across both phases. When gpt-4o-mini runs first, gpt-5-mini benefits from its cache.


What's Actually Being Shared

Before someone pedantically corrects me: this is prefix-processing cache sharing, not KV-cache sharing.

The models share:

  • Tokenization pipeline

  • Prefix normalization

  • Cache key hashing

They do not share transformer attention states. That's architecturally impossible—gpt-4o-mini and gpt-5 have different layer counts, hidden dimensions, and weight matrices. Their KV caches are mathematically incompatible.

What OpenAI has built is a shared prefix-processing layer that sits in front of the model-specific forward pass. When you call gpt-5-mini after gpt-4o-mini with the same prefix, the prefix-processing layer says "I've already tokenized and normalized this 1,400-token prefix—here it is" and hands it to gpt-5-mini's model.

From a billing perspective, it doesn't matter. Cached tokens are cached tokens. The 90% discount applies either way.

Why gpt-5 showed inconsistency:

In both Order A tests, gpt-5 missed cache even though gpt-5-mini hit it. I ran this multiple times—the pattern held. gpt-5 is less consistent at hitting shared cache.

My hypothesis: gpt-5 is a reasoning model with different prefix handling. It may do additional processing on the prefix that breaks cache key matching. Or it routes to different servers. I don't have enough data to say definitively, but gpt-5-mini is the most reliable for cross-model cache benefits.


Production Cost Implications

Cross-model cache sharing matters when you have high cold-start rates. If your cache stays warm naturally (sustained traffic, same prefix), cross-model warming adds minimal value.

But if you're starting many separate sessions, the savings compound fast.

Scenario: 1,000 cold starts per day

Assume:

  • 10,000 token system prompt (large tool set, detailed instructions)

  • 1,000 separate user sessions per day (different contexts, each needs cache warmup)

  • Primary model: gpt-5 ($1.25/1M input tokens)

Without cross-model warming:

Each session's first call pays the full 10K token cost:

  • Per session: 10,000 tokens × $1.25/1M = $0.0125

  • Daily: 1,000 × $0.0125 = $12.50

  • Annual: $4,562

With gpt-5-nano warming first:

Each session warms with gpt-5-nano ($0.05/1M input tokens), then calls gpt-5:

  • Nano warmup: 10,000 tokens × $0.05/1M = $0.0005

  • gpt-5 call: 10,000 tokens × $0.125/1M (90% cached) = $0.00125

  • Total per session: $0.00175

  • Daily: 1,000 × $0.00175 = $1.75

  • Annual: $639

Savings: $3,923/year (86% reduction on warmup costs)

Scale this to gpt-5-pro ($15/1M input tokens):

  • Without warming: $54,750/year

  • With nano warming: $639/year

  • Savings: $54,111/year

Scale to 100,000 calls/day with the same 10K prefix:

  • Without warming: $456,250/year

  • With nano warming: $63,875/year

  • Savings: $392,375/year

Cost Comparison Table

Calls/DayTarget ModelWithout WarmingWith Nano WarmingAnnual Savings
1,000gpt-5$4,562$639$3,923
1,000gpt-5-pro$54,750$639$54,111
100,000gpt-5$456,250$63,875$392,375

These numbers assume every call is a cold start. In practice, you'll have some natural cache retention. But the principle holds: for systems with high session turnover, explicit cache warming with a cheap model saves real money.

When this matters:

  • High cold-start rate (many separate sessions/contexts per day)

  • Large prefixes (10K+ tokens)

  • Expensive target model (gpt-5, gpt-5-pro)

  • Cost-sensitive production systems

When this doesn't matter:

  • Sustained single-model traffic (cache stays warm naturally)

  • Small prefixes (<2K tokens—savings too small vs added latency)

  • Latency-critical paths (extra API call adds 100-500ms)


Implementation Strategy

The simplest approach: call the cheap model first, wait for the response (confirms cache is warm), then call the expensive model.

Pseudocode:

def warm_then_call(prefix_messages, tools, target_model="gpt-5"):
    """
    Warm cache with cheap model, then call expensive model.
    """
    # Warm cache with gpt-5-nano
    warmup_response = client.chat.completions.create(
        model="gpt-5-nano",
        messages=prefix_messages,
        tools=tools,
        max_tokens=1  # We don't care about output, just warming
    )

    # Confirm cache was created
    # (In production, you'd log this for monitoring)

    # Now call target model - should hit warm cache
    response = client.chat.completions.create(
        model=target_model,
        messages=prefix_messages + [user_message],  # Add user query
        tools=tools
    )

    # Check if cache hit occurred
    cached = response.usage.prompt_tokens_details.cached_tokens
    total = response.usage.prompt_tokens
    print(f"Cache hit: {cached}/{total} tokens")

    return response

Tradeoffs:

Adding a warmup call costs:

  • Extra API call (nano is cheap but not free)

  • Added latency (100-500ms for the warmup call)

The latency matters. For interactive user-facing applications, an extra 200ms is noticeable. For batch processing or background jobs, it's irrelevant.

When nano-first makes sense:

  • Prefix > 5K tokens (savings outweigh warmup cost)

  • Target model is expensive (gpt-5, gpt-5-pro)

  • Latency tolerance > 200ms

When it doesn't:

  • Small prefixes (< 2K tokens—warmup cost ≈ savings)

  • Latency-critical paths

  • Sustained traffic (cache stays warm anyway)

Monitoring:

Track cached_tokens in your logs. Calculate cache hit rate:

cache_hit_rate = cached_calls / total_calls

If you're seeing < 50% hit rate, investigate:

  • Is your prefix changing between calls?

  • Are you exceeding cache retention time (5-10 min idle)?

  • Is traffic bursty enough that cache evicts between calls?


Limitations and Caveats

This behavior is not officially documented. OpenAI's docs mention prompt caching but don't specify cross-model sharing. I discovered it empirically.

What this means:

  • Behavior could change without notice

  • OpenAI might intentionally disable cross-model sharing

  • Future model releases might not share the same pipeline

Other limitations:

  1. Cache eviction is unpredictable. The 5-10 minute guideline is approximate. During high load, caches evict faster. During low load, they persist longer.

  2. Hit rate is probabilistic. I saw 80-90% in tests, not 100%. Server routing, load balancing, and cache state all affect whether you hit cache.

  3. Organization-scoped. Cache is tied to your API key. Different organizations don't share cache (obviously), but even different keys within the same org won't share.

  4. Byte-for-byte prefix matching. A single character difference in your system prompt breaks the cache. Even whitespace matters.

  5. Extra API call adds latency. Nano is fast, but it's still a round trip. For latency-sensitive paths, this may outweigh cost savings.

  6. gpt-5 showed lower consistency. In my tests, gpt-5 missed cache more often than gpt-5-mini. If your target model is gpt-5, test thoroughly before assuming reliable cache hits.

Treat this as an optimization for specific workloads, not a universal best practice. Measure your own hit rates before committing to a warmup strategy.


Reproduction Steps

If you want to verify this yourself:

Requirements:

  • OpenAI API key

  • System prompt + tools totaling > 1024 tokens

Test procedure:

  1. Create a prompt with at least 1024 tokens. Use a detailed system message or add several tool definitions.

  2. Call gpt-4o-mini three times with identical prefix. Log cached_tokens from each response.

  3. Wait 5 seconds.

  4. Call gpt-5-mini with the same prefix. Check cached_tokens on the first call.

  5. If cached_tokens > 0 on gpt-5-mini's first call, you've confirmed cross-model cache sharing.

Minimal test script:

import openai
import time

client = openai.OpenAI(api_key="your-key")

messages = [
    {"role": "system", "content": "Your 1024+ token system prompt here..."},
    {"role": "user", "content": "Test query"}
]

tools = [...]  # Your tool definitions

# Call 1: gpt-4o-mini
response1 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    tools=tools
)
print(f"4o-mini: {response1.usage.prompt_tokens_details.cached_tokens} cached")

time.sleep(5)

# Call 2: gpt-5-mini
response2 = client.chat.completions.create(
    model="gpt-5-mini",
    messages=messages,
    tools=tools
)
print(f"5-mini: {response2.usage.prompt_tokens_details.cached_tokens} cached")

Expected output:

4o-mini: 0 cached
5-mini: 1408 cached

If gpt-5-mini shows cached tokens on its first call, you've reproduced the finding.


Key Takeaways

Cross-model cache sharing exists. It's not documented, but it's measurable and reproducible. gpt-4o-mini, gpt-5-mini, and gpt-5 share a prefix-processing cache at the organization level.

The cost impact scales with cold starts. For sustained traffic with natural cache warmth, cross-model warming adds little. For high session turnover (1,000+ cold starts/day), explicit nano-warming can save $4K-$400K/year depending on target model and prefix size.

Tool definitions are heavily compressed. Don't avoid adding tools for token concerns. OpenAI's schema compression means the overhead is far lower than JSON character count suggests.

Measurement beats assumption. Token economics requires logging every call, tracking cached_tokens, and calculating actual costs. The only way to know if an optimization works is to measure it in your specific workload.

This is Phase 3 of ongoing research. Next up: structured outputs (eliminating retry loops), reasoning effort control (gpt-5 token/quality tradeoff), and batch API (50% cost reduction with 24-hour latency). Each technique gets tested with real numbers, not theory.

If you're building production LLM systems, log your token usage. The optimizations aren't obvious until you see where the tokens actually go.


Building this agent from scratch—no frameworks, full visibility—specifically to understand token costs at every layer. All experiments, code, and data published as I go.