The Future of Voice AI: How Standalone Companies Can Thrive in a Speech-to-Speech World

When OpenAI unveiled its speech-to-speech model, the entire future of the voice AI market was called into question. Are we witnessing the beginning of the end for standalone voice AI companies?

There’s no doubt that voice leaped from trend to mainstay with explosive adoption, and voice agents are popping up everywhere from restaurant reservation lines to triaging patient calls for healthcare providers. Gartner predicts that this surge will only continue, with generative AI powering 75% of new contact centers by 2028.

We're approaching an inflection point where voice AI becomes as common as chatbots, but with a crucial difference: voice collapses the boundary between software and the real world. Voice connects users to businesses, systems, and services that relied entirely on human labor, enabling automation in places where it wasn’t previously possible.

The question is, will model providers like OpenAI and Google completely dominate the field? Or is there room for standalone voice AI companies to become key players in the enterprise?

In my view, the answer is more nuanced than winner-take-all. While speech-to-speech models represent a fundamental shift in how voice AI works, they don't spell doom for the voice infrastructure ecosystem (and standalone voice AI companies like ElevenLabs, Pipecat, Vapi and Deepgram). Instead, they're forcing an evolution—one that will likely strengthen, not weaken, the case for specialized voice platforms.

The Great Architecture Debate

Before we debate what the future might hold, let’s review where we are now. Today, there are two approaches to building voice agents:

Why the Chained Approach is Currently Dominant

Speech gets transcribed to text, processed by an LLM, then converted back to speech. This modularity became the foundation of today's booming voice AI ecosystem for several reasons:

Leveraging existing breakthroughs: Teams could immediately capitalize on the LLM revolution by plugging text-to-speech into voice workflows, rather than waiting for purpose-built voice models.
Best-of-breed optimization: Developers could optimize each component independently, mixing and matching best-in-class tools for each layer (like Deepgram's real-time transcription, OpenAI's reasoning capabilities and ElevenLabs' emotional synthesis). Thus, creating better overall results than any single end-to-end system.
Rapid experimentation: Developers could swap components easily to test different combinations, leading to faster iteration cycles and more innovative applications.
Cost control: Teams could choose cheaper models for simple tasks and premium models for complex reasoning, optimizing their cost structure.

This modular approach essentially allowed the entire voice AI ecosystem to build upon itself as AI capabilities advanced, as well as giving developers the control to fine-tune each layer of the stack to meet their unique needs.

Speech-to-Speech is Emerging—and Upending the Status Quo

Audio processes end-to-end, preserving tone, emotion, and conversational nuance without the intermediate text conversion. And the results are compelling—OpenAI's advanced voice mode can laugh, whisper, and interrupt naturally in ways that feel genuinely human. Companies like Sesame are pushing even further, creating voices so realistic they're indistinguishable from human speakers.

The implications seem obvious: why build a complex chain of components when a single model can handle it? Speech-to-speech models promise lower latency (no conversion delays), better conversational flow, and simpler architecture.

The conventional wisdom says that consolidation is inevitable, that speech-to-speech will dominate because it's fundamentally more efficient and the stack will get more condensed. But this view misses a critical nuance: the model layer is just one piece of the enterprise voice AI puzzle.

Why Standalone Platforms Will Survive—and Thrive

The shift to speech-to-speech models doesn't crater the voice infrastructure field, but it does change what they need to build. Here's why:

Enterprise Needs Go Far Beyond the Model

While consumer applications might succeed with a direct API call to OpenAI or Google, enterprise voice deployments require a sophisticated layer of capabilities that model providers are unlikely to build:

Enterprise integrations: Voice agents must connect seamlessly with existing CRM, ERP, and workflow systems.
Compliance and governance: Healthcare and financial services need detailed audit trails, data residency controls, and regulatory compliance features.
Operational complexity: Converting existing call scripts into effective prompts, managing conversation flows, handling edge cases, and ongoing performance optimization—work that enterprises would rather outsource than build in-house.
Advanced analytics: Enterprises want granular insights into conversation quality, agent performance, and customer sentiment.
Fallback logic and reliability: Mission-critical applications need sophisticated error handling and humans-in-the-loop handoff capabilities.

Taking a voice agent from demo to production involves a long tail of capabilities—from scripting and routing logic to exception handling and continuous optimization. And enterprises consistently tell us they want flexibility to remain model agnostic not just to optimize costs (e.g., using cheaper models for routine tasks and premium ones for complex reasoning), but to retain control over sensitive data. Especially in regulated sectors, consolidating all voice data with a single provider can raise privacy concerns and increase vendor dependency.

The demand for optionality creates a natural moat for platforms that can orchestrate multiple models and abstract away the complexity of switching between them.

Economics Favor Specialization

The cost dynamics of voice AI are shifting dramatically in favor of specialized platforms.

Google’s recently-released Gemini Live API costs less than OpenAI’s Realtime API for real-time voice processing. While there are still challenges with these models that make them harder to deploy in production at scale (inconsistent reliability, lack of robust context management tools, and architectural inflexibility for enterprise applications) they serve as a clear signal: voice model pricing is going to drop fast.

But here's the counterintuitive part: as base model costs plummet, the value shifts to everything around the model. Fundamentally enterprises are not paying for voice tech, enterprises will pay for premium platforms that include:

Reliability guarantees: 99.9% uptime with sophisticated failover systems
Integration sophistication: Seamless connections to Salesforce, ServiceNow, and legacy systems
Advanced analytics: Granular insights that generic models can't provide
Compliance infrastructure: SOC 2, HIPAA, and industry-specific certifications

This creates a classic "commoditize your complement" scenario: spend less on the commodity (the model), charge more for the differentiation (the platform capabilities).

The Evolution Strategy

The most successful voice infrastructure companies aren’t resisting the speech-to-speech shift—they’re evolving into full-stack platforms by expanding in both directions. Some are going deeper into the model layer, while others are moving up the stack into verticalized applications. This reflects a broader platformization of the voice AI ecosystem, where defensibility increasingly comes from owning more of the end-to-end experience.

ElevenLabs, originally focused on text-to-speech, is now building out full conversational agents and multimodal creation tools. Its recently-launched voice generation app lets users create, edit, and interact with voice content directly—positioning the company closer to creator tools and synthetic media platforms than infrastructure alone.
Vapi embraced verticalization by building pre-configured voice agents for specific industries like healthcare and financial services. Its EHR assistant, clinical decision support agent, and clinic triage workflow are more than API demos—they're early products that help customers accelerate time-to-value by embedding domain knowledge and integrations out of the box.
Deepgram continues to lead in real-time transcription, but has also moved into domain-specific models for use cases like medical transcription, launching Nova-2 Medical, which is optimized for clinical speech and HIPAA-aligned deployments. This move downward into specialized model development signals a tighter coupling between infra and application-specific performance.

The pattern is clear: successful voice infrastructure companies are moving up and down the stack toward applications while maintaining their core technical advantages.

What This Means for the Future

The rise of speech-to-speech models is undeniably reshaping the voice AI landscape, but it’s not the end of the voice infrastructure layer. It’s the evolution of the category.

As the underlying models improve and commoditize, value is shifting to the companies that can make voice AI usable, trustworthy, and integrated at scale.

That’s why the winners in this next wave won’t just be model labs or API wrappers—they’ll be platforms that understand the messy realities of enterprise deployment. They’ll offer orchestration across models, deliver domain-specific solutions, and embed voice AI into the workflows that enterprises currently rely on. We’re already seeing early signs of this strategy playing out across healthcare, financial services, and other regulated industries.

This isn’t a race to consolidate the stack, it’s a race to abstract it. As I wrote in “Vertical Voice Agents Are Taking Off,” voice companies that specialize (by industry, by workflow, or by use case) gain an unfair advantage: faster GTM, better performance, and deeper customer lock-in.

We’re heading toward a hybrid future where model intelligence is table stakes, and operational execution is the differentiator. Voice AI won’t be won by the most human-like voice…it will be won by the most usable one.