Voice brokers have been costly to run and painful to orchestrate, not as a result of the fashions can't deal with dialog, however as a result of context ceilings compelled enterprises to construct session resets, state compression, and reconstruction layers into each deployment. OpenAI's three new voice fashions are designed to cut back that overhead, and so they change how engineers can take into consideration constructing voice into a bigger agent stack.
GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper combine real-time audio into the mannequin administration stack as discrete orchestration primitives — separating conversational reasoning, translation, and transcription into specialised parts somewhat than bundling them in a single voice product.
The corporate stated in a weblog publish that Realtime-2 is its first voice mannequin “with GPT-5 class reasoning” and may deal with troublesome requests and hold conversations flowing naturally. Realtime-Translate understands greater than 70 languages and interprets them into 13 others on the speaker's tempo, and Realtime-Whisper is its new speech-to-text transcription mannequin.
These three actions now not sit inside a single stack or mannequin. GPT-Realtime-2 may technically deal with transcription, however OpenAI is routing distinct duties to specialised fashions: Realtime-Translate for multilingual speech and Realtime-Whisper for transcription. Enterprises can assign every activity to the suitable mannequin somewhat than routing the whole lot via a single, all-encompassing voice system.
The brand new OpenAI fashions compete in opposition to Mistral’s Voxtral fashions, which additionally separate transcription and goal enterprise use circumstances.
What enterprises ought to do
Extra enterprises are seeing the worth of voice brokers now that extra individuals are turning into snug conversing with an AI agent, and in addition due to the richness of information from voice buyer interactions.
Organizations evaluating these fashions might want to contemplate their orchestration structure, not simply mannequin high quality — particularly, whether or not their stack can route discrete voice duties to specialised fashions and handle state throughout a 128K-token context window.

