Gemini Introduced Context Caching for AI - It's 4x Cheaper but No One Will Use It
Google just announced Context Caching for Gemini Flash and Pro 1.5 models. Let's break down the numbers and see who it's really for:
Rob Balian
CTO
📊 The Basics
For Gemini Flash 1.5:
Non-cached input: $0.35 / 1M tokens (already super cheap compared to $5 / 1M tokens for GPT-4o)
Cached input: $0.0875 / 1M tokens - that's 4x cheaper than non-cached!
Cache storage cost: $1.00 / 1M tokens per hour
Max context window: still 1M tokens whether it's using cached or non-cached
🎣 So What's the Cache?
Sounds great, right? 4x savings for my prompts! But here's the catch:
🚫 Minimum cacheable context: 32k tokens. What?! That's like 50 pages of text. The average chatbot context is < ~4k tokens
Most AI applications, including chatbots, use nowhere near 32k tokens per interaction. And the chat responses themselves can't be cached because they change constantly.
So Who's It For?
✅ Specialized apps with massive, static contexts
❌ Most chatbots and general AI applications
What About RAG Applications?
Retrieval-Augmented Generation (RAG) apps might seem like a perfect fit, but there's a twist:
Typical RAG retrieves ~4,000 tokens per query
This falls far short of the 32k minimum for caching
Caching the entire knowledge base could be prohibitively expensive
The Takeaway
Caching is overkill for most: If you're not regularly using 32k+ token prompts, you won't see savings.
Storage costs add up: $1/hour per 1M tokens cached. That's $720/month for just one full context.
Management overhead: Implementing and maintaining caches isn't trivial.
While Context Caching could be revolutionary for niche, large-context applications, it's likely irrelevant for the vast majority of AI use cases.
I hope that Google and Sam start including caching automatically behind the scenes to drive token prices down, instead of offering it as a service that doesn't seem to work for anyone.
You Want the Numbers? Here You Go:
Scenario 1: Customer Support Chatbot with a Massive 32k Context Window
Without Caching:
Tokens per Message: 32,000 (context)
Cached Tokens per Message: 0
Non-Cached Tokens per Message: 32,000
Messages per Hour: 1,200
Token Cost per Hour: $13.44
Storage Cost per Hour: $0.00
Total Cost per Hour: $13.44
With Caching:
Tokens per Message: 32,000 (cached)
Cached Tokens per Message: 32,000
Non-Cached Tokens per Message: 0
Messages per Hour: 1,200
Token Cost per Hour: $3.36
Storage Cost per Hour: $0.032
Total Cost per Hour: $3.392 (75% cost DECREASE)
Scenario 2: Customer Support Chatbot with Reasonable Context Window of 4k
Without Caching:
Tokens per Message: 4,000 (context) + 1,000 (overhead) = 5,000
Cached Tokens per Message: 0
Non-Cached Tokens per Message: 5,000
Messages per Hour: 1,200
Token Cost per Hour: $2.10
Storage Cost per Hour: $0.00
Total Cost per Hour: $2.10
With Caching (adhering to the minimum 32k cached context window):
Tokens per Message: 32,000 (cached) + 1,000 (overhead) = 33,000
Cached Tokens per Message: 32,000
Non-Cached Tokens per Message: 1,000
Messages per Hour: 1,200
Token Cost per Hour: $3.465
Storage Cost per Hour: $0.032
Total Cost per Hour: $3.497 (+66% cost INCREASE)
Scenario 3: Retrieval-Augmented Generation (RAG) Model
Without Caching:
Tokens per Message: 4,000 (retrieved) + 1,000 (overhead) = 5,000
Cached Tokens per Message: 0
Non-Cached Tokens per Message: 5,000
Messages per Hour: 1,200
Token Cost per Hour: $2.10
Storage Cost per Hour: $0.00
Total Cost per Hour: $2.10
With Caching:
Tokens per Message: 300,000 (cached) + 1,000 (overhead) = 301,000
Cached Tokens per Message: 300,000
Non-Cached Tokens per Message: 1,000
Messages per Hour: 1,200
Token Cost per Hour: $31.50 (cached input) + $0.42 (overhead input) = $31.92
Storage Cost per Hour: $0.30
Total Cost per Hour: $32.22 (1400% cost INCREASE)