Gemini context caching on Reprompt

We're not using context caching, but it sure is cool

Gemini Introduced Context Caching for AI - It's 4x Cheaper but No One Will Use It

Rob Balian's picture
CTO @ Reprompt

June 28, 2024

Google just announced Context Caching for Gemini Flash and Pro 1.5 models. Let's break down the numbers and see who it's really for:

📊 The Basics

For Gemini Flash 1.5:

  • Non-cached input: $0.35 / 1M tokens (already super cheap compared to $5 / 1M tokens for GPT-4o)
  • Cached input: $0.0875 / 1M tokens - that's 4x cheaper than non-cached!
  • Cache storage cost: $1.00 / 1M tokens per hour
  • Max context window: still 1M tokens whether it's using cached or non-cached

🎣 So What's the Cache?

Sounds great, right? 4x savings for my prompts! But here's the catch:

  • 🚫 Minimum cacheable context: 32k tokens. What?! That's like 50 pages of text. The average chatbot context is < ~4k tokens
  • Most AI applications, including chatbots, use nowhere near 32k tokens per interaction. And the chat responses themselves can't be cached because they change constantly.

So Who's It For?

  • ✅ Specialized apps with massive, static contexts
  • ❌ Most chatbots and general AI applications

What About RAG Applications?

Retrieval-Augmented Generation (RAG) apps might seem like a perfect fit, but there's a twist:

  • Typical RAG retrieves ~4,000 tokens per query
  • This falls far short of the 32k minimum for caching
  • Caching the entire knowledge base could be prohibitively expensive

The Takeaway

  • Caching is overkill for most: If you're not regularly using 32k+ token prompts, you won't see savings.
  • Storage costs add up: $1/hour per 1M tokens cached. That's $720/month for just one full context.
  • Management overhead: Implementing and maintaining caches isn't trivial.

While Context Caching could be revolutionary for niche, large-context applications, it's likely irrelevant for the vast majority of AI use cases.

I hope that Google and Sam start including caching automatically behind the scenes to drive token prices down, instead of offering it as a service that doesn't seem to work for anyone.

You Want the Numbers? Here You Go:

Scenario 1: Customer Support Chatbot with a Massive 32k Context Window

Without Caching:

  • Tokens per Message: 32,000 (context)
  • Cached Tokens per Message: 0
  • Non-Cached Tokens per Message: 32,000
  • Messages per Hour: 1,200
  • Token Cost per Hour: $13.44
  • Storage Cost per Hour: $0.00
  • Total Cost per Hour: $13.44

With Caching:

  • Tokens per Message: 32,000 (cached)
  • Cached Tokens per Message: 32,000
  • Non-Cached Tokens per Message: 0
  • Messages per Hour: 1,200
  • Token Cost per Hour: $3.36
  • Storage Cost per Hour: $0.032
  • Total Cost per Hour: $3.392 (75% cost DECREASE)

Scenario 2: Customer Support Chatbot with Reasonable Context Window of 4k

Without Caching:

  • Tokens per Message: 4,000 (context) + 1,000 (overhead) = 5,000
  • Cached Tokens per Message: 0
  • Non-Cached Tokens per Message: 5,000
  • Messages per Hour: 1,200
  • Token Cost per Hour: $2.10
  • Storage Cost per Hour: $0.00
  • Total Cost per Hour: $2.10

With Caching (adhering to the minimum 32k cached context window):

  • Tokens per Message: 32,000 (cached) + 1,000 (overhead) = 33,000
  • Cached Tokens per Message: 32,000
  • Non-Cached Tokens per Message: 1,000
  • Messages per Hour: 1,200
  • Token Cost per Hour: $3.465
  • Storage Cost per Hour: $0.032
  • Total Cost per Hour: $3.497 (+66% cost INCREASE)

Scenario 3: Retrieval-Augmented Generation (RAG) Model

Without Caching:

  • Tokens per Message: 4,000 (retrieved) + 1,000 (overhead) = 5,000
  • Cached Tokens per Message: 0
  • Non-Cached Tokens per Message: 5,000
  • Messages per Hour: 1,200
  • Token Cost per Hour: $2.10
  • Storage Cost per Hour: $0.00
  • Total Cost per Hour: $2.10

With Caching:

  • Tokens per Message: 300,000 (cached) + 1,000 (overhead) = 301,000
  • Cached Tokens per Message: 300,000
  • Non-Cached Tokens per Message: 1,000
  • Messages per Hour: 1,200
  • Token Cost per Hour: $31.50 (cached input) + $0.42 (overhead input) = $31.92
  • Storage Cost per Hour: $0.30
  • Total Cost per Hour: $32.22 (1400% cost INCREASE)