Gemini Introduced Context Caching for AI - It's 4x Cheaper but No One Will Use It

Google just announced Context Caching for Gemini Flash and Pro 1.5 models. Let's break down the numbers and see who it's really for:

Rob Balian

CTO

📊 The Basics

For Gemini Flash 1.5:

Non-cached input: $0.35 / 1M tokens (already super cheap compared to $5 / 1M tokens for GPT-4o)
Cached input: $0.0875 / 1M tokens - that's 4x cheaper than non-cached!
Cache storage cost: $1.00 / 1M tokens per hour
Max context window: still 1M tokens whether it's using cached or non-cached

🎣 So What's the Cache?

Sounds great, right? 4x savings for my prompts! But here's the catch:

🚫 Minimum cacheable context: 32k tokens. What?! That's like 50 pages of text. The average chatbot context is < ~4k tokens
Most AI applications, including chatbots, use nowhere near 32k tokens per interaction. And the chat responses themselves can't be cached because they change constantly.

So Who's It For?

✅ Specialized apps with massive, static contexts
❌ Most chatbots and general AI applications

What About RAG Applications?

Retrieval-Augmented Generation (RAG) apps might seem like a perfect fit, but there's a twist:

Typical RAG retrieves ~4,000 tokens per query
This falls far short of the 32k minimum for caching
Caching the entire knowledge base could be prohibitively expensive

The Takeaway

Caching is overkill for most: If you're not regularly using 32k+ token prompts, you won't see savings.
Storage costs add up: $1/hour per 1M tokens cached. That's $720/month for just one full context.
Management overhead: Implementing and maintaining caches isn't trivial.

While Context Caching could be revolutionary for niche, large-context applications, it's likely irrelevant for the vast majority of AI use cases.

I hope that Google and Sam start including caching automatically behind the scenes to drive token prices down, instead of offering it as a service that doesn't seem to work for anyone.