Cutting Edge

Reviving the Sound of Classic Movies with AI

AI Sound Separation

Nov 18, 2020

Separating individual sounds from a mixed audio source

Sound separation technology
that has dramatically leaped forward with AI

Sound separation is a technology that makes it possible to extract individual sounds from a mixed audio source. This was originally considered to be an incredibly difficult thing to accomplish, but in 2013, we have incorporated Sony’s AI technology which has allowed us to dramatically improve the performance. Results have already been achieved for example, in reviving classic movies, elimination of noise on smartphones, and allowing real-time karaoke for music streaming services, and we expect it will be applied to an even wider variety of fields in the future.

  • Yuki Mitsufuji

    Tokyo Laboratory 21
    R&D Center
    Sony Corporation

  • Stefan Uhlich

    Stuttgart Laboratory 1
    R&D Center
    Sony Corporation

Recreating human abilities using machines

──What kind of technology is AI sound separation?

Yuki Mitsufuji:AI sound separation is a technology that makes it possible to remove unnecessary noise from audio data and to extract just the vocals or other specific instruments. When humans listen to a performance where multiple sounds are mixed together, we are able to distinguish individual instruments, or we can naturally focus on a single voice when having a conversation, even when surrounded by a large crowd. These are abilities that are unique to humans and until recently, this was extremely difficult to do using computers. Some people described this task as mixing two juices and extracting one of them afterwards. However, in the last few years, the technology has dramatically improved thanks to the introduction of new methods for AI.

Stefan Uhlich:Before, people tried to put in a lot of domain knowledge into the separation, for example knowledge about the mixing process. Furthermore, simpler models were preferred as they allowed to theoretically study them. This has now changed as it is much better to learn the separation system from data using AI.

Three examples of our sound separation applied to Lawrence of Arabia, where we show how we can extract the dialogue as well as various foley sounds

──How is AI used?

Mitsufuji:Our sound separation is carried out by AI and we can teach our computers to fulfill this task. For example, a guitar has a specific sound or frequency that is learned by a neural network. Regardless of how many sounds are mixed, our AI system is capable of identifying these characteristics. It is just like how we can spot an apple when we see one because we have seen many of them before. AI is applied to sound separation in much the same way, both mechanically and conceptually.

Uhlich:The neural network learns to identify the audio characteristics during a so-called training. In this training, the network sees a lot of music – more music than we will ever hear in our lifetime – together with the target sound that we should extract. This information is sufficient for the neural network to learn the sound separation.

Rewinding time and remixing recordings

──What is special about AI sound separation technology?

Mitsufuji:We think this is one of the few technologies that can rewind time. For example, you can take a recording from the past where the parts had to be recorded all together and specifically extract the vocals to remix them, or separate all the instruments to recombine them into a new format.

──We heard that it is also being used for movies.

Uhlich:In order to provide an immersive sound field to those watching movies, it is necessary to deliver sounds from a number of different angles and recreate a 3D audio space. However, classic movies have the dialogue and sound effects on the same track, and so there has been a limit to what we can extract, and how immersive we can make the sound field. We started wondering if we could extend our technology to movies, and after learning from a sound effect (Foley) library, our AI system was able to successfully extract individual sound effects from the master copy. As you can also view in the video above, for the 4K UHD versions of Lawrence of Arabia and Gandhi released in the U.S., Sony Pictures Entertainment sound mixers took sounds extracted with this technology and remastered them using Dolby Atmos to create an immersive sound field.

Illustration showing the foley sound separation process and its usage for movie upmixing

The 4K UHD versions of Lawrence of Arabia and Gandhi recorded in the Columbia Classics Collection Vol 1.

Bringing the value of sound separation to more people

──Looks like it can be used in a variety of other fields too.

Mitsufuji:This technology is also expected to find non-movie applications such as cleaning up human voices recorded through microphones. For example, Sony’s autonomous robotic "puppy", aibo, can respond to human voices and communicate, but if aibo simply gathers the surrounding sounds, noises such as aibo’s own mechanical sounds or wind noise will also be picked up. By using AI sound separation to extract human voices and remove all other background sounds, we have been able to improve its voice recognition capabilities. Similarly, by cleaning up just human voices during phone calls on Xperia™ smartphones, we have made it possible to chat without worrying about wind noise. Another recent example is that it has been used for a "karaoke mode" for a music streaming app. A karaoke-like experience is possible by using sound separation technology, that removes the vocals in real-time from the music being streamed, and enabling to mix the user's voice with the sound source.

──What are the future possibilities and prospects for this technology?

Mitsufuji:We hope that our technology will be like a time machine that allows artists from the past and present to collaborate across time. Sony PCL and Sony Music Solutions are just starting to offer services externally using our technology, so there is definitely more to come. I am very much looking forward to it.

Uhlich:From a technological point of view, we will see the transition to universal source separation where not only the number of sources is unknown but also the source types are unspecified. People recognized this as a challenging but interesting scenario which will enable even more commercial use cases.

Related article