full research article
kernel™ is the cloud-based infrastructure that makes real-time interactions viable.
Autoregressive pre-trained transformers and other types of architectures for large language models are computationally expensive to run. This is partly why the existing user interfaces of these applications are limited to a turn-based experience, where the complete input, including texts, images, and potentially audio, is provided before the computation begins. After a delay, the response is streamed back to the user as text, chunk by chunk, indicating the end of an interaction. This paradigm scales awkwardly to a long-running, voice-to-voice experience, where a human expects, on average, a delay of at most 250 ms before a response would be considered unnatural. The further complication is that it is not immediately clear when the user has started or stopped speaking, what and when the core model should start processing user input and synthesize the response, and how to deliver such response back to the user in a natural way.
kernel™ is rabbit's cloud-based solution that creates an infrastructure to address this problem and make large language model-based real-time interactions viable. We took inspiration from existing literature (such as speculative decoding, utterance segmentation, and pipelining) to design new techniques that allow fast, predictive, and natural delivery of multimedia content to users in a natural language interactions. In these settings, a foundation model, which could be a text-only autoregressive pre-trained transformer or other multi-modal variants, is placed at the center of a long-running reconciliation loop. The loop constantly receives asynchronous updates from different services and the user, and synthesizes responses comprising voice, images, or text in a streaming fashion.