
Google
has developed a method it calls audiovisual speech separation. It used machine learning to identify
specific speech sounds from audio in video.
The technology does more than just isolate the verbalized words from the background and other sources. It can separate the speech patterns of two or
more people talking.
The details posted in a Google Developers’ blog describe the technology, which also includes several videos demonstrating how it works.
The technology
can separate background and ambient noise on one audio track. Researchers believe this capability can have a range of applications, from speech enhancement and recognition in videos and voice search
to video conferencing and the ability to improve hearing aids.
The technique combines auditory and visual signals of an input video to separate the speech. The visual signal aims to improve
the speech separation quality in cases of mixed speech -- but more importantly, it also associates the separated speech tracks with the visible speakers in the video.
Researchers explained
that when training the model they used a large collection of 100,000 high-quality videos of lectures and talks from YouTube. From the videos they extracted segments with a clean speech, which means
they removed any sounds of music and other speakers. It results in about 2,000 hours of video clips, each of one person visible to the camera and talking with no background interference.
Google provides several videos in which researchers were able to separate the speech of two people talking such as sports commentators and someone talking in a noisy cafeteria.
The
technology also can be used for automatic video captioning.