Introduction
LLM Latency
The old Natural Language Processing approaches to voice AI fell short of user expectations. With the arrival of LLMs, voice and multimodal AI will eat the world.
But first, we have to address latency. In voice conversation, users expect a 300 millisecond or less response from the AI. In applications with a sufficiently well-written system prompt, the context window bogs the latency down to seconds – and that’s before conversation turns add to the context window.
Semantic Caching
Canonical AI is an optimization engine for conversational AI. A ultra-low latency semantic cache is the core of our technology. In order for the semantic cache to work in a conversation, we make the cache aware of the conversational context.
Here is a basic description of a simple semantic cache. For each new user query, the code first semantically searches the cache for what is essentially the same query – even if the query phrasing is different. If a match is found in the cache, the code returns the answer from the cache rather than calling the LLM. Cache hits return responses faster and cost less compared to calling a foundational LLM API.
For conversational AI, the cache needs to know about the context of the user query. For example, a user may ask a question about something in the beginning of a conversation, then ask the very same question about a different matter later in the conversation. Without awareness of the context, a semantic cache will cache the first response and use it to incorrectly answer the user’s second question.
To address the contextual requirements of conversation AI caching, a newly-deployed Canonical Cache observes before acting. In observation mode, the Canonical Cache detects the parts of the conversation that can be cached and the parts that should not be cached. With each new detection of a cache opportunity, the key-value pairs are added to the cache. Other technologies, such as cache seeding, also build context for the Canonical Cache.