Semantic Cache: Reduce LLM Latency and Costs

Drive down costs and latency with our context-aware LLM cache for conversational AI.

Generate an API key for a free two-week trial!

cache.py
import httpx
import openai
client = openai.Client(
base_url='https://cacheapp.canonical.chat/'
http_client=httpx.Client(
headers={
"X-Canonical-Api-Key": "<your-api-key>"
}
)
)
completion = client.chat.completions.create(...)

How Does Semantic Caching Work?

For each user query, first semantically search the cache for what is essentially the same query – even if the query phrasing is different. If a match is found in the semantic cache, return the response from the semantic cache rather than calling the LLM. A cache hit has a faster response time and costs less compared to an LLM.

Semantic Cache Demo

Semantic Cache Demo

In the first conversation in the demo above, the user asks new questions, the LLM responds, and the cache gets populated.

In the second conversation (after the terminal is cleared), the user asks the same questions, but with different phrasing. The responses are returned from the semantic cache and the time to first token is ~10x faster.

Features

  • Context-Aware Semantic Caching. High precision caching for conversational AI (i.e., Voice AI agents). Get cache hits only when it's contextually appropriate.
  • High Recall Semantic Caching. Get cache hits even in open-ended AI conversations.
  • Fast Semantic Caching. Response times are ~50 ms for on-prem deployments and ~120 ms for over-the-network.
  • Secure Semantic Caching. Queries with PII are not cached so user data is safe.
  • Multitenancy. Each product, each AI persona, or each user can have its own cache. You decide the scope of the cache.
  • Tunable Cache Temperature. You decide whether you want a cache hit to return the same response or differently phrased responses.
  • Simple Integration. Deploy our LLM Cache one step upstream of your LLM call. If there’s a cache hit, don’t call your LLM. If there’s a cache miss, then update the semantic cache with the LLM completion after you’ve responded to the user.

Pricing

We charge only for cache hits. On cache hits, we charge 50% of the per token price of your LLM model.

For example, let's say an LLM API call would have cost you $0.02 from the LLM provider. When Canonical AI returns the response from the cache, you pay us $0.01 rather than paying $0.02 to the LLM provider.

Each month, we sum the amount of tokens that were processed on cache hits. We multiply the amount of tokens by 50% of the API costs for the LLM model you're using. Then, we email you an invoice.

Don't Delay On Dropping Latency

Generate an API key using the link above to try out the Canonical AI cache for yourself!

If you'd like to discuss how conversational AI caching can help reduce your LLM app's latency and cost, please email us! We'd love to hear from you!

Next
Blog