Computer scientists at MIT have developed a machine-learning system that can identify objects in an image based on a spoken description of the image.
Typical speech recognition systems like Google Voice and Siri rely on transcriptions of thousands of hours of speech recordings, which are then used to map speech signals to specific words.
Still in its early stages, the MIT system learns words from recorded speech clips and objects in images and then links them. Several hundred different words and objects can be recognized so far, with expectations that future versions can advance to a larger scale.
“We wanted to do speech recognition in a way that’s more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don’t typically have access to,” stated David Harwath, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Spoken Language Systems Group. “We got the idea of training a model in a manner similar to walking a child through the world and narrating what you’re seeing.”
In a paper co-authored by Harwath, researchers used an image of a young girl with blond hair, blue eyes, wearing a blue dress with a background comprising a lighthouse with a red roof.
“The model learned to associate which pixels in the image corresponded with the words “girl,” “blonde hair,” “blue eyes,” “blue dress,” “white light house” and “red roof,”” according to the paper. “When an audio caption was narrated, the model then highlighted each of those objects in the image as they were described.”
Future uses of the machine-learning technology could include dealing with different languages.
“One promising application is learning translations between different languages, without need of a bilingual annotator,” states the announcement of the development by MIT. “Of the estimated 7,000 languages spoken worldwide, only 100 or so have enough transcription data for speech recognition. Consider, however, a situation where two different-language speakers describe the same image. If the model learns speech signals from language A that correspond to objects in the image, and learns the signals in language B that correspond to those same objects, it could assume those two signals — and matching words — are translations of one another.”