OpenAI Says ChatGPT Can Now See, Hear, And Speak

Speech, voice, and vision features that OpenAI announced Monday in its ChatGPT move the technology to a conversational interface that advertising agencies and companies can build into their own platforms.

The voice capability -- which supports the same interactions as artificial assistant (AI) applications like Apple Siri, Amazon Alexa, and Google Voice Assistant -- is powered by a text-to-speech model capable of generating human-like audio from text with a few seconds of sample speech.

This means users can talk to the chatbot and engage with a back-and-forth conversation. OpenAI collaborated with professional voice actors to create each of the voices, and used Whisper, its open-source speech-recognition system, to transcribe spoken words into text.

Experts have raised concerns about AI-generated synthetic voices to enable convincing deepfakes and cyber threats, and OpenAI acknowledged those concerns today. Voice inputs from actors were used.



The announcement did not provide information about how OpenAI would use consumer voice inputs or how the company would secure that data.

Spotify is using the technology behind the voice feature for the platform's podcasters to translate content in different languages, OpenAI said.

The vision-based models support image recognition. From an image, OpenAI’s technology can help users do a variety of things. It can work in a similar way to an instructional video -- for example, showing how to adjust a bicycle seat.

OpenAI powers the image understanding by multimodal GPT-3.5 and GPT-4. These models apply language reasoning skills to a range of images, such as photographs, screenshots, and documents containing both text and images.

What happens when voice and images are combined? Users can snap a picture of something and have a live conversation with the chatbot. Ask what to make for dinner and then ask for a step-by-step recipe.

Voice and images in ChatGPT will roll out to Plus and Enterprise users during the next two weeks. Voice is coming on iOS and Android, and images will be available on all platforms.

But having vision also means the technology can now see what the user sees. OpenAI has been working with Be My Eyes, a free mobile app for the blind and people with limited vision to understand the uses and the limitations of the technology.

Next story loading loading..