People

Apr 2, 2025

Building technologies to expand the future of sound for creators

The International Conference on Acoustics, Speech, and Signal Processing (ICASSP) will host its 50th session starting on April 6, 2025. Yuki Mitsufuji, a Distinguished Engineer at Sony Group, has contributed to numerous research papers accepted at ICASSP, as well as leading AI conferences such as ICLR and CVPR. We sat down with him to speak about the evolution of sound separation technology, his current work, and the challenges he aims to tackle in the future.

▶Click here for past article about sound separation.
“AI Sound Separation - Reviving the Sound of Classic Movies with AI”

  • Yuki Mitsufuji

    Distinguished Engineer
    Sony Group Corporation
    Sony Research Inc.

Overcoming the Challenges of Sound Separation and Expanding its Applications

──What did you study at university?

I majored in computer science and engineering, an field closely related to what we call machine learning today. My research theme was music. At the time, I was actively engaged in music - composing songs and performing at live venues -which naturally led me to pursue this topic. The first technology I worked on was automatic transcription, which involves extracting musical notation from audio signals.

Later, I decided to join Sony to take on the challenge of developing technology at a company involved in the music industry.

──What kind of work did you take on after joining Sony? Were there any turning points in your career?

After joining Sony, I was assigned to sound technology development for audio products, including DSEE and LDAC technologies, but I always had a strong desire to be directly involved in the entertainment industry. One experience that deepened that desire for me was studying abroad in France.

In 2011, I had the opportunity to conduct research at IRCAM (Institut de Recherche et Coordination Acoustique/Musique), the French National Institute for Music and Acoustics. This institute is not only a place where researchers are trained but also a hub for musicians , fostering a collaborative environment where both groups contribute to the advancement of music.

While conducting research on sound at IRCAM, I came to realize that sound separation was a major bottleneck when effectively utilizing music or dialogue. Speech recognition and song analysis are extremely challenging because the signal to be processed is a mixture of several signals. This has been an obstacle for variety of advanced signal processing applications.

For example, immersive content was beginning to gain popularity at that time. In music, achieving an immersive experience requires precise control over the volume and positioning of different elements, such as vocals, piano, drums, and guitar. However, sound separation technology was not very precise at that time, making it incredibly difficult to isolate vocals or individual instruments from a mixed recording source.

I realized that if we could significantly improve sound separation technology, it could make a major contribution to the entertainment industry, so I began pursuing it in earnest. That realization set me on the path I continue to follow today.

Early Adoption of Deep Learning

──How has sound separation technology evolved over the years?

Sound separation has been a subject of research since the 1990s, initially focusing on how to replicate the cocktail party effect*1 using machine learning technology. From the 2000s to the early 2010s, Nonnegative Matrix Factorization(NMF) emerged as a breakthrough method, dramatically enhancing the potential of sound separation technology.

In my view, a turning point in this research area was the keynote speech by Geoffrey Everest Hinton at ICASSP 2013. He is a pioneer who laid the foundation for what we now call AI-driven deep learning. In his speech, he demonstrated that deep learning could dramatically enhance the accuracy of image and speech recognition compared to traditional methods.

Amid a crowd so large that the venue was overflowing, I listened to his talk and felt a strong conviction: "This is going to be big." Of course, many others in the audience must have shared this anticipation. However, I believe we were among the first to take action, and we quickly beginning to apply deep learning to sound separation. This early adoption is one of the key reasons why Sony continues to maintain a competitive edge in music separation technology today.

Following this realization, we redesigned our sound separation technology using deep learning and entered an international competition called the SiSEC (Signal Separation Evaluation Campaign). At the time, we were the only ones implementing deep learning, which made us somewhat anxious. However, we ended up securing an overwhelming first-place victory. As other participants later integrated deep learning into their own systems, the gap between us gradually narrowed. Still, we managed to win first place for three consecutive years. Furthermore, we took the initiative to organize major competitions, such as the Music Demixing Challenge 2021 and Sound Demixing Challenge 2023, further solidifying Sony’s reputation as a leader in sound separation technology within the research community.

──How did the technology fare in terms of practical application?

When professional production engineers tested our technology, their initial feedback was that it was nowhere near ready for commercial use. Even though our technology had achieved top recognition in academia, there were still significant hurdles to making it viable for real-world applications.

One of the major breakthroughs we achieved after much trial and error was the production of "Richard Strauss: Enoch Arden", a collaboration between actor Kanji Ishimaru and the late Glenn Gould. In the original recordings, Gould’s piano was mixed with spoken-word narration at the time. Using our sound separation technology, we successfully isolated the narration, enabling a new Japanese narration by Ishimaru to be seamlessly integrated, transforming the piece into a completely new work.

We also achieved major successes in the film industry. For instance, in the 4K UHD restorations of "Lawrence of Arabia" and "Gandhi", we used our technology to extract the original audio elements. The sound mixers at Sony Pictures Entertainment then remapped the extracted sounds into a Dolby Atmos format, successfully creating an immersive sound field that faithfully reproduced the original atmosphere.

Through sound separation, I feel like I’ve finally been able to bridge the gap between academic research and real-world entertainment applications, fulfilling my long-held dream of contributing to the entertainment industry.

That said, challenges remain, particularly with classical music and older films, where audio was not recorded in separate sections or is highly complex. However, considering that speech recognition technology has recently reached levels once thought impossible around 2011, I’m confident that these challenges will be overcome in the near future.

Tackling Entertainment Challenges Through the Realm of Sound

──In addition to sound separation, you have also been working on other AI-driven technologies . Could you tell us more about them?

Since around 2020, when generative AI began to gain attention, we have focused on developing our own generative models i.e., creating our own tools rather than relying on black-box technologies*2.

I believe that relying on black-box technology can ultimately slow down development. For instance, when using diffusion models, if the output isn’t as expected, only the original creators of the model can fully understand why it isn’t working. Without a thorough understanding of how the tools function, we would inevitably fall behind in the field. That’s why we made a deliberate effort to focus on mastering and developing our own models, rather than just using pre-existing tools.

This approach has gradually started to pay off, and we are now seeing a tangible results.

The key advantage of the diffusion model our team developed is its “high-speed generation.” Originally, we developed this technology for image generation, but we have now successfully applied it to sound. We believe this speed gives us a strong competitive advantage in the industry. By reducing the number of diffusion steps, which typically require dozens of iterations, we have achieved significant acceleration*3. Moreover, leveraging this high-speed processing capability, we are also working on real-time generation*4. For example, while a 10-second audio segment is playing, the next 10 seconds are being generated simultaneously. Traditionally, the creating diverse sounds have been a time-consuming process. However, with AI interaction, people will soon be able to generate sounds in real time. The paper we submitted on this research was accepted for ICLR 2025 that will be held in next month,

Mastering the core mechanisms of diffusion models has also enabled us to develop technologies that can countermeasures against potential AI threats.

In an ideal world, generative AI would create new content by learning from appropriately sourced data. However, in reality, there is a phenomenon called memorization*5, which occurs where generated outputs inadvertently contain portions of the original training data, posing a significant risk to the creators of the original content. To address this issue, we are developing technologies to detect unauthorized learning and persistent watermarking methods that remain embedded even after AI training, ensuring long-term protection of original data.

──What are your future aspirations?

The challenges around generative AI will increasingly affect various creative industries. What I want to focus on is building a system that ensures creators are fairly compensated and that their livelihoods are not threatened, even as new technologies continue to emerge. I hope to address the issues that creative fields are likely to face in the future from the sound technology perspective.

By creating an environment where creators can fully express their talents and by developing technologies that are embraced by them, I believe we can enrich the world with more captivating content, allowing more people to experience new and innovative sound and visuals.

Notes

*1 Cocktail Party Effect: The phenomenon where a person can focus on and distinguish specific sounds or speech in a noisy environment
*2 Black-box technology: A technology that users can utilize and obtain results from understanding its underlying principles or structure
*3 Reference paper: Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion
*4 Reference paper: SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation
*5 Reference paper: Classifier-Free Guidance inside the Attraction Basin May Cause Memorization (This paper was also accepted for CVPR 2025)

Share

  • LinkedIn
  • X
  • Facebook

Related article

Back to Stories