Semantic Cache: Faster and Cheaper AI

Drive down costs and latency with our context-aware LLM cache for conversational AI.

Get your API key

cache.py
import httpx
import openai
client = openai.Client(
base_url='https://cacheapp.canonical.chat/'
http_client=httpx.Client(
headers={
"X-Canonical-Api-Key": "<your-api-key>"
}
)
)
completion = client.chat.completions.create(...)

How Does Semantic Caching Work?

For each user query, first semantically search the cache for what is essentially the same query – even if the query phrasing is different. If a match is found in the semantic cache, return the response from the semantic cache rather than calling the LLM. A cache hit has a faster response time and costs less compared to an LLM.

Semantic Cache Demo

Semantic Cache Demo

In the first conversation, the user asks new questions, the LLM responds, and the cache gets populated.

In the second conversation (after the terminal is cleared), the user asks the same questions, but with different phrasing. The responses are returned from the semantic cache and the time to first token is 10x faster.

Features

  • Context-Aware Semantic Caching. High precision caching for conversational AI (i.e., Voice AI agents). Get cache hits only when it's contextually appropriate.
  • High Recall Semantic Caching. Get cache hits even for open-ended applications where you wouldn’t expect many cache hits.
  • Fast Semantic Caching. Response times are ~40 ms for on-prem deployments and ~120 ms for over-the-network.
  • Secure Semantic Caching. Queries with PII are not cached so user data is safe.
  • Multitenancy. Each product, each AI persona, or each user can have its own cache. You decide the scope of the cache.
  • Tunable Cache Temperature. You decide whether you want a cache hit to return the same response or differently phrased responses.
  • Simple Integration. Deploy our LLM Cache one step upstream of your LLM call. If there’s a cache hit, don’t call your LLM. If there’s a cache miss, then update the semantic cache with the LLM completion after you’ve responded to the user.

Save 50% On Your LLM Token Costs

We charge only for cache hits. Whatever model you're using, we charge 50% of the per token price on cache hits.

Don't Delay On Dropping Latency

Generate an API key using the link above to try out the Canonical cache for yourself!

If you'd like to discuss how conversational AI caching can help reduce your LLM app's latency and cost, please email us! We'd love to hear from you!

Next
Blog