Cartesia AI specializes in building real-time multimodal intelligence, with a particular focus on voice AI for enterprise applications.
Unlike traditional cloud-hosted foundation models that operate in batch mode with delays of 500-600 milliseconds, interactive applications like voice and video demand immediate responses, where speed is paramount and quality is table stakes.
Cartesia aims to shift the paradigm of foundation models to enable real-time, multi-modal operations that can run anywhere, not just in the cloud but on any device.
For voice AI, a second of delay in response feels awkward, highlighting the need for responses in milliseconds to ensure a smooth and effective user experience, especially in customer support.
Key challenges in voice AI include handling interruptions, globalization factors like accents and background noises, and the need for subjective customization.
Cartesia's voice AI solutions are built on three core principles: exquisite quality (naturalness of voice), low latency (hearing the first sound as soon as possible), and strong controllability (customizing the agent's voice to reflect brand identity).
They have pioneered a new architecture called state space models (SSMs) as an alternative to transformers, which typically scale quadratically with input length.
SSMs achieve O(1) generation at inference time by maintaining a state, enabling very low latency that is not possible with traditional transformer architectures.
Cartesia's SSMs have closed the performance gap with transformers, performing better not only in latency but also in quality.
A primary challenge for customers building voice AI agents is the high latency of large language models (LLMs), requiring the voice generation component to be extremely fast to provide sufficient slack for the LLM to reason.
Controllability is a key feature, allowing for high-quality voice cloning, accurate accent capture, and native inclusion of background noises, which can make an agent sound more natural and less like an "uncanny valley."
Voice AI has penetrated various markets, including healthcare, customer support, and real-time gaming, where it's used for dynamic non-player characters.
Cartesia supports human narrators and voice actors through a voice marketplace, aiming to amplify and license their unique essence and personality rather than replace them.
While Cartesia provides low latency on the text-to-speech (TTS) side, integrating with LLMs like Claude can still result in high end-to-end latency due to the LLM's processing time.
AWS's design philosophy for its generative AI ecosystem emphasizes providing customers with optionality through model gardens like SageMaker Jumpstart and Amazon Bedrock, which host various models for specific use cases.
Achieving next-level voice AI models requires both large-scale pre-training data and very rich, diverse preference data for fine-tuning, as user preferences in audio are highly varied.
Speech-to-speech models are currently considered immature for production or enterprise-grade use cases, with orchestrated solutions offering more controllability, despite speech-to-speech models potentially dominating in latency over time.
Cartesia's models are designed to run locally on edge devices, offering unparalleled latency and being approximately five times faster than cloud roundtrips for certain applications.
When monitoring voice agents, issues are frequently found at the LLM stage, but Cartesia has made its system more robust by handling many edge cases and formatting needs for TTS outputs.
Cartesia's Sonic 2 model achieves a 40-millisecond model latency and has substantially improved quality.
By 2030, voice AI is predicted to become the "de facto norm" in every industry, covering interactions from triaging to full support and gaming.
The future of true interactive models extends beyond just what can be heard, evolving into "world models" that operate in real-time as assistants or co-pilots, helping humans understand the world in new ways.