How developers can bring voice AI into telephony applications
In the era of support apps and chatbots, telephony continues to hold strong as the backbone of customer communication, and voice AI is entering the call center scene to further streamline customer interaction.
However, this means developers are suddenly being confronted with a whole new set of challenges, foremost among them the difficulty of bridging the gap between layers of AI and “legacy” telecom networks. In fact, as large language models constantly evolve and update, the voice AI pipeline must be designed from the outset for easy switching. With much uncertainty surrounding the shift, one thing is clear: It’s crucial not to underestimate the challenges latent in AI-telephony integration.
Voice AI agents have a multitude of enterprise use cases. They are a valuable tool for setting customer appointments, then rescheduling and canceling them as needed. Moreover, they serve to triage inbound calls, before routing them correctly to human agents. Voice AI can even shoulder the responsibility of organizing ETAs, coordinating deliveries, and scheduling candidates for interviews.
Businesses should assume from the start that they will want to change components of the voice AI pipeline and pick accordingly, focusing on systems that give them flexibility. That said, further problems are continuing to present themselves to developers.
Why telephony is still hard for developers
People often assume that a voice AI agent is simply ChatGPT with a voice, an agent embedded in AI that is receiving and routing calls. This is far from reality. Voice AI agents require a whole infrastructure, containing multiple components that flesh out the LLM to operate successfully in the real world.
- Large language models (LLMs): The cornerstone of any AI call system, they interpret intent, plan steps, and generate responses, all of which enable seamless comms between caller and agent.
- Speech-to-text (STT): This technology is the crucial channel of the system as it converts caller audio to text, without which analytics cannot take place.
- Text-to-speech (TTS): The counterpart and inverse of STT, synthesizing the agent’s response and making it sound like natural speech.
- Turn-taking: How to remain conversational when relying on an AI? That’s where turn-taking comes into play, with voice activity detection and barge-in policies that allow the tone to stay natural.
- Telephony gateway: This bridging device converts PSTN/SIP/WebRTC and manages signaling and media.
These pieces fit together in a complex network of telephony infrastructure, albeit one with some limitations. Local telecom carriers must reckon with these, in addition to their business’s own compliance needs, requirements, and constraints. To this end, communications networks always comprise a mix of vendors and technologies, meaning that enterprises need to stay flexible as they integrate new components with existing elements.
This is especially true for voice AI applications, which have some of the most stringent technical requirements. Application developers should aim to coordinate voice AI-specific elements while interoperating with existing systems.
The technical reality check
Developers face a set of gritty technical problems when integrating voice AI into telecom networks. Moving forward with building a voice AI agent—one that really works in production—means unpacking these issues and building solid solutions.
Managing latency
Latency is a niggling issue that threatens any good voice AI system. Gaps and pauses before hearing a response are a red flag for callers: The user may conclude that the agent either isn’t there or that the tech isn’t working properly.
The International Telecommunications Union (ITU) recommends a mouth-to-ear latency of less than 400 milliseconds to maintain a natural conversation. “Mouth-to-ear” refers to the length of time between words leaving someone’s lips and hitting the ear, or being heard by the listener. It then usually takes humans a couple of hundred milliseconds to start to respond. All of this means that, in order to mimic human interaction, AI systems must be able to provide a response in a tight time window. The AI’s response will initiate another trip as the sound moves back through the network, allowing the original talker to hear the response. All in all, the whole interaction needs to take around a second, otherwise it will start to feel off. In reality, most voice AI systems are on the cusp of reaching this measure, yet this is improving with new technologies and better techniques.
Latency can make or break effective real-time AI systems. We’ve seen this with cases of latency coupled with missing language support in health care. A startup based in Australia, for example, wanted to use an AI caller to check on elderly Cantonese-speaking patients. This would seem to be a good use of the technology. However, high latencies to US-based voice AI infrastructure, plus a lack of Cantonese TTS, made the experience unnatural.
Solutions to latency problems resemble engineering modifications. You strive to cut latency wherever you can in the development phase. This requires real-time flows, end-to-end—that is, stream in and out concurrently, rather than waiting for the LLM to produce the full text output before passing it to the TTS to be synthesized.
Keeping a close eye on long delays during calls is also key. This allows a response to be injected when necessary, keeping pauses or silences to a minimum. In fact, another aspect of the solution is holding a steady stream of communication with the user. Rather than the line going silent, leading them to suspect something is wrong, it’s key to make a point to inform callers that a delay is coming up. Background noises can similarly instill confidence that your query is being handled despite any pauses.
Impersonal AI
Another problem for voice AI lies in the potential for AI to become quite monotonous and impersonal, leaving callers with the feeling they were dialed through to some homogenous AI system. Third-party TTS systems exist for this very purpose. Expanding voice options, bringing more variety to the service, help to retain a human touch.
It’s a mark of the diversity of the field that solutions in voice AI-telephony take many forms. Streaming TTS can allow for lower latency, while some vendors offer a wide variety of voices, allowing you to pick one that is unique to your business and needs. Some companies will already have a voice that is identifiable with their brand, meaning that they can clone and input that voice to their voice AI system. Having a distinctive voice speak directly to customers through telephony can be a powerful asset. Others, however, should be able to select from a variety of different voices to find one that aligns well with their brand.
Integrating with telephony systems
One further issue is integrating your AI agent with existing telephony systems, particularly the contact center and enterprise infrastructure. These are themselves often made up of a blend of systems from a mix of vendors; whilst the SIP standard governs most of traditional telephony, that is not a guarantee of interoperability. Indeed, older systems are often fixed or limited in their settings, meaning that new systems must be highly adaptable.
In this context, it makes sense to pick an experienced vendor, someone who knows how to interoperate in a variety of environments and with different systems. Another hack is to ensure they have solid debugging tools and the support needed to work through any unexpected issues that might crop up.
Network quality can vary wildly between countries, particularly in rapidly evolving regions like Latin America. For example, we have seen unreliable SIP interconnections from Mexico, with customers forced to route through the US, adding unnecessary latency. In turn, major investments in Brazil’s infrastructure in recent years have improved service not only within the country but also across the larger region. Ideally, your CPaaS (communications platform as a service) provider will have carrier relationships across many countries, allowing them to optimize traffic in all situations.
Five tips for building real-time voice AI that works
So, to summarize the above, I’ve pulled together five tips on how to build a real-time voice AI that actually works.
- Start by defining the needs and constraints of the user. It’s equally critical to be aware of latency tolerance, supported languages and geographies, as well as other factors like KPIs and compliance scope.
- Choose your comms integration and media path carefully. Specifically, think about where you stand in terms of voice versus messaging. If you go down the voice road, figure out what your architecture will look like, particularly around CPaaS, trunks, transfers, and DTMF (dual tone multi-frequency) signaling.
- No voice AI is complete without a solid, compatible real-time AI pipeline. First, pick an LLM; choosing the underlying LLM will power the behaviors of your voice system, influencing latency, compliance, tone, and much more. Having clarity on voice and pipelines from the start will help businesses craft an effective voice AI.
- Deep integration with existing systems is another piece of the puzzle, allowing the tech to disseminate important information and context about the caller, such as names and account details. Unnatural memory omissions from the bot are a serious non-starter. A well-integrated system can help avoid common downfalls (latency, missing barge-in, or hallucinations) and make your voice AI feel alive.
- Productionization is mission-critical to all telephony applications. It’s key to call centers, to real-time gaming and trading systems, and to your voice agent, which you’ve so successfully built with the goal of running flawlessly on every phone call. Properly built infrastructure enables the bot to manage word error rate, latency, and autoscaling.
Voice AI agents are constantly evolving, representing an iterative tech with a unique set of challenges. I’ll conclude with some tips for future-proofing your voice AI and telecom stack against this backdrop of evolution.
What’s next for real-time voice AI
One key piece of advice is to get ahead of the curve on LLM and speech vendors. Assume that these aren’t static components, but that you’ll want to swap them in order to move with the times. Don’t put yourself on the back foot, but make sure it’s possible to mix and match on your platform.
More broadly, avoid being caught out by evolutions in the tech. By anticipating quality and performance improvements in speech and AI, rather than being overtaken by them, you’ll be able to quickly mobilize improvements when they emerge. Even if you’re reaping the benefits of a certain approach today, don’t hold on for too long, or else a better strategy that’s coming out tomorrow will pass you by.
It’s also worth mentioning that the global reach of voice AI is both a challenge and an advantage. In the San Francisco Bay Area, a significant portion of voice AI orchestration platforms primarily target US users. That’s all well and good, but companies with more internationalized customer bases have the upper hand because they face challenges that many more localized companies have not yet experienced.
For example, latency is a major challenge internationally, where voice AI data centers may be further away (or only based in the US) and telecom carriers may be less reliable. This gives international providers the edge because their global footprint leads to solid carrier relationships and extensive voice AI partners.
Ultimately, it will only be a matter of years before the new generation of voice applications is much-improved over what we see today. In fact, the integration may be so seamless that it will be hard to tell the difference between AI agents and human agents in state-of-the-art systems. This should accelerate call centers in replacing their legacy IVR (interactive voice response) systems with voice AI. So too should it drive developers and stakeholders to build AI-driven call workflows fit for real-world use.
—
New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
Original Link:https://www.infoworld.com/article/4136039/how-developers-can-bring-voice-ai-into-telephony-applications.html
Originally Posted: Tue, 10 Mar 2026 09:00:00 +0000












What do you think?
It is nice to know your opinion. Leave a comment.