Semantic Cache: Faster and Cheaper AI

Drive down costs and latency with our ultra-low latency LLM cache and turnkey fine-tuned models.

cache.py
import httpx
import openai
client = openai.Client(
base_url='https://cacheapp.canonical.chat/'
http_client=httpx.Client(
headers={
"X-Canonical-Api-Key": "<your-api-key>"
}
)
)
completion = client.chat.completions.create(...)

How Does Semantic Caching Work?

For each user query, first semantically search the cache for what is essentially the same query – even if the query phrasing is different. If a match is found in the semantic cache, return the response from the semantic cache rather than calling the LLM. A cache hit has a faster response time and costs less compared to an LLM.

Semantic Cache Demo

Semantic Cache Demo

In the first conversation, the user asks new questions, the LLM responds, and the cache gets populated.

In the second conversation (after the terminal is cleared), the user asks the same questions, but with different phrasing. The responses are returned from the semantic cache and the time to first token is 10x faster.

Features

  • Fast Semantic Caching. Response times are ~40 ms for on-prem deployments and ~120 ms for over-the-network.
  • High Precision Semantic Caching. Quality is paramount for your user experience. On a cache hit, the Canonical Cache correctly answers the user query.
  • High Recall Semantic Caching. Get cache hits even for applications where you wouldn’t expect many cache hits.
  • Multitenancy. Each product, each AI persona, or each user can have its own cache. You decide the scope of the cache.
  • Tunable Cache Temperature. You decide whether you want a cache hit to return the same response or differently phrased responses.
  • Simple Integration. Deploy our LLM Cache one step upstream of your LLM call. If there’s a cache hit, don’t call your LLM. If there’s a cache miss, then update the semantic cache with the LLM completion after you’ve responded to the user.

Save 50% On Your LLM Token Costs

We charge only for cache hits. Whatever model you're using, we charge 50% of the per token price on cache hits.

Don't Delay On Dropping Latency

If you're interested in trying out the Canonical Cache, please email us! We'll send you an API key.

Next
Blog