OpenAI GPT-4o Gets Real-Time Conversational: Sees, Hears, Analyzes

Conversational models that can see, hear and analyze have become a reality. OpenAI Monday announced a new GPT model that brings powerful capabilities such as smarter, faster real-time voice and visual interactions. The technology is likely to replace traditional search in the future.

GPT-4o offers another level of intelligence to users, OpenAI CTO Mira Murati said during a livestream presentation. The "o" stands for "omni," referring to the model's ability to handle text, speech, and video. 

The company demonstrated real-time interactions with the voice assistant, including faster responses, real-time conversations, visual interactions, and the ability to interrupt the AI assistant.

OpenAI research lead Mark Chen interacted with ChatGPT to demonstrate real-time conversational speech.

“I’m onstage right now doing a live demo and frankly,” he said, “I’m feeling a little nervous.”

advertisement

advertisement

When Chen asked the voice assistant to help him calm his nerves, the ChatGPT told him to “take a deep breath and remember, you’re the expert.”

“I like that suggestion,” he said, asking for feedback on his breathing before starting to take quick, deep breaths that sounded like someone hyperventilating.

“Slow down, there. Mark. You sound like a vacuum cleaner. Breathe in for four and exhale slowly,” ChatGPT said, between comments from Chen.

The key to this demonstration -- aside from the chatbot understanding the rapid rhythm of Chen's breath, which did not seem normal or correct -- was the ability for him to interrupt ChatGPT’s response and have it pick up after identifying the change in conversation. The dialog created a real-time voice conversation.

Microsoft also announced Monday that it has integrated OpenAI's latest GPT-4o model into Azure AI. 

OpenAI said this new model can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. 

OpenAI's previous most advanced model, GPT-4 Turbo, was trained on a combination of images and text and could analyze images and text to accomplish tasks like extracting text from images or even describing the content of those images. But GPT-4o adds speech.

Another demonstration showed ChatGPT reading an AI-generated story in different voices, including dramatic and robotic tones. It also sung.

A third demo look and analyzed an algebra equation. It helped the person solve it rather than providing an answer.

And just when you thought it couldn't get any more creepy, an OpenAI video demonstrates how two chatbots can have a conversation.

Developers at OpenAI have been working on these capabilities for some time. In September 2023, OpenAI gave ChatGPT the ability to speak, hear and see, but it was in the early stages of development.

The voice capability, powered by a text-to-speech model that generates human-like audio from text with a few seconds of sample speech, supported the same interactions as AI applications from Apple Siri, Amazon Alexa, and Google Voice Assistant.

Talking to the chatbot allows people to engage with a back-and-forth conversation. OpenAI collaborated with professional voice actors to create each of the voices, and used Whisper -- its open-source speech-recognition system -- to transcribe spoken words into text.

Apple has been rumored to be building OpenAI's AI technology into the next version of Apple's iPhone operating system iOS 18. 

Wedbush analyst Dan Ives wrote in a note Monday that he believes "it's a done deal," but the conjecture is based on media reports. Google also wants that coveted partnership.

Whichever company Apple chooses, the partnership will likely be formally announced at the company's WWDC developer conference on June 10. Analysts believe it will be related to Apple devices, high-end chips, and cloud infrastructure that power its large language models (LLMs). 

Next story loading loading..