Sakana AI Unveils KAME for Real-Time Smarter Voice Interactions
Sakana AI, a Tokyo-based research lab, has introduced a new system called KAME that aims to make voice conversations more natural and intelligent. This new architecture tries to solve a long-standing problem in speech AI: how to respond quickly while also being smart and informed. It combines the speed of direct speech-to-speech models with the knowledge depth of large language models in real time.
The Challenge of Fast and Smart Voice AI
Traditional voice assistants face a tough choice. Some respond very fast, often even before a person finishes asking a question, but their answers tend to be shallow or generic. These models, like Moshi, process audio directly and generate responses almost instantly, but they don’t have much room for complex reasoning or detailed knowledge because they focus on speed.
On the other hand, systems that route speech through a speech recognition step, then to a large language model (LLM), and back to speech, produce more accurate and knowledgeable answers. But they take longer—usually around two seconds—because the system needs to wait for the user to finish speaking, making conversations feel less natural and more robotic.
KAME’s Innovative Tandem Architecture
KAME introduces a hybrid system that works with two parts running at the same time. The first part is based on models like Moshi, which process audio and generate speech very quickly. It starts responding immediately, even as the user is still talking. The second part involves a speech-to-text module connected to a large language model that listens to the ongoing speech and gradually builds a transcript.
The key feature of KAME is its “oracle” stream. As the user speaks, the speech-to-text system sends partial transcripts to the LLM, which then generates tentative responses—called oracles—that are sent back to the speech system. These oracles are like educated guesses that improve as more of the speech is processed. This way, the system can adjust its response mid-sentence, making the conversation flow more naturally.
Training and Performance of KAME
One challenge is that no existing dataset contains these oracle signals, so Sakana AI researchers created a method called Simulated Oracle Augmentation. They used a simulator LLM and a standard dataset to generate synthetic responses that mimic real-time responses at different levels of completeness. These were used to train KAME, helping it learn how to handle partial information effectively.
Tests show KAME performs very well. When evaluated on a speech-based question-and-answer benchmark, it scored significantly higher than traditional models, approaching the quality of cascaded systems but with almost no delay. For example, KAME with GPT-4.1 as the backend scored over three times higher than Moshi, all while maintaining near-instant responses. Although it doesn’t quite match the top cascaded systems in raw accuracy, its real-time responsiveness offers a big step forward for natural voice AI.
Overall, KAME represents a new approach to making voice interactions smarter without sacrificing speed. Its ability to think while speaking could lead to more natural and helpful voice assistants in the future. Sakana AI continues to refine the system, promising exciting developments in speech AI technology.












What do you think?
It is nice to know your opinion. Leave a comment.