Apr 22, 2025
Sony Shares New Research at International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)
Sony researchers shared new work in Hyderabad, India at the 50th annual 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Signal processing research is essential for advancing the state of the art across audiovisual creative mediums. Sony joined as a Platinum Sponsor, contributing 13 academic research papers and a tutorial featuring insights from Sony researchers and academic partners worldwide. Sony team members and collaborators connected with participants about breakthroughs in the field to date as well as important challenges still to come.
Transforming chaos into harmony: a tutorial on diffusion models in audio signal processing
Pictured Left to right: Chieh-Hsin Lai (Sony AI), Cong Bac Nguyen (Sony AI), Koichi Saito (Sony AI), Yuhki Mitsufuji (Sony AI)
Researchers from Sony AI hosted a tutorial on diffusion model research, sharing insights about new methodologies and use cases within audio signal processing. The researchers explored how techniques originally developed for computer vision tasks can be transposed into audio media restoration and controllable audio generation. After a live demo of diffusion model training and application, the team discussed emerging trends that are shaping the future of the field.
Pictured left to right: Cong Bac Nguyen (Sony AI), Chieh-Hsin Lai (Sony AI), Koichi Saito (Sony AI), [Speaker] Yuhki Mitsufuji (Sony AI)
Lecture series: surveying 30+ years of source separation research and looking to the future of audio
Sony researchers welcomed ICASSP attendees to presentations throughout the week based on featured publications. Sony AI shared a historical look at source separation (SS) research and the evolution of Music Information Retrieval (MIR), including the important role of benchmarking and open research.
Pictured: Mathias Bjare (Johannes Kepler University Linz)
Mathias Bjare from Sony Computer Science Laboratories - Paris (Sony CSL - Paris) / Johannes Kepler University Linz spoke about modeling musical surprisal in audio. Musical surprisal refers to when an event in a musical sequence deviates from the listener’s expectation. Mathias Bjare presented a general model for estimating musical surprisal in full-length musical audio through information content (IC). In this work, Lattner’s team successfully modeled human musical surprisal, showing that IC predicts human EEG (electroencephalographic) brain responses to songs.
Pictured left to right: Stefan Lattner (Sony CSL - Paris), Alain Riou (Télécom-Paris/Sony CSL - Paris)
Another paper with contributions from Sony CSL - Paris, Sony AI, and Télécom-Paris explores methods of musical stem retrieval, a task which involves helping musicians and producers locate discrete audio files that complement and enhance any given track being created. The authors present a method based on Joint-Embedding Predictive Architectures where an encoder and predictor are jointly trained based on arbitrary instruments. This enables the model to perform zero-shot stem retrieval, meaning that the model can retrieve stems from instruments it has not been trained on.
Pictured: Stefan Lattner (Sony CSL - Paris)
Lattner and team also presented Music2Latent2, a novel audio compression encoder that achieves higher reconstruction quality. The encoder uses an autoregressive consistency model to handle arbitrary audio lengths and a two-step decoding procedure to refine generated audio.
Pictured: Yunkee Chae (Sony AI/IPAI)
Researchers from Sony AI and Sony Group Corporation presented VRVQ, a neural audio coding method that helps evolve current state of the art approaches by allowing for more efficient coding and improves model training.
In addition to topics covered in conference lectures, Sony researchers presented novel solutions for pressing and practical challenges faced by researchers across the field of signal processing.
Pictured: Raj Gohil (Sony AI/Sony Research India)
In a Sony AI paper from Sony Research India, authors Kumud Tripathi, Raj Gothi, and Pankaj Wasnik explore novel approaches to enhance multilingual speech recognition. While foundation models such as Whisper have aided advancements in automatic speech recognition in recent years, challenges remain when these models struggle to perform for “low-resource languages,” in other words, languages that have limited data available for conversational AI systems to train on. In this paper, the authors improve performance through prompt-tuning and the introduction of a novel tokenizer to accelerate inference speed.
Researchers from Sony Group Corporation offer a solution for a different real-world scenario: people speaking over one another in meetings. In their paper, the authors present a novel attention-based encoder-decoder method augmented with speaker class tokens to untangle overlapped speech.
Additional publications from Sony Semiconductor Solutions, Sony Europe B.V., and Sony Research expand on diffusion model applications within the field of audio. Sony Europe B.V. presented a generative model called Hi-ResLDM which offers a novel approach high-resolution speech restoration. In a collaboration between Sony Europe B.V., Sony AI, and academic partners, researchers propose a method for music timbre transfer, a task which involves modifying the timbral characteristics of an audio signal while preserving its melodic structure.
Explore opportunities at Sony
Pictured: Shawn Barrett (SGC)
Sony team members enjoyed connecting in India with members of the signal processing community at ICASSP 2025. To learn more about career opportunities across Sony’s R&D teams, visit our Careers page.
Sony at ICASSP 2025: Accepted Papers
30+ Years of Source Separation Research: Achievements and Future Challenges (Sony AI)
Shoko Araki (NTT Corporation), Nobutaka Ito (University of Tokyo), Reinhold Haeb-Umbach (Paderborn University), Gordon Wichern (Mitsubishi Electric Research Laboratories), Zhong-Qiu Wang (Southern University of Science and Technology), Yuki Mitsufuji (Sony AI)
---
COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations (Sony Semiconductor Solutions)
Ruben Ciranni (Sapienza University of Rome), Giorgio Mariani (Sapienza University of Rome), Michele Mancusi (Sony Europe B.V.), Emilian Postolache( Ca' Foscari University of Venice), Giorgio Fabbro (Sony Europe B.V.), Emanuele Rodolà (Sapienza University of Rome), Luca Cosmo (Ca' Foscari University of Venice)
---
Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization (Sony Research India)
Kumud Tripathi (Sony Research India), Raj Gothi (Sony Research India), Pankaj Wasnik (Sony Research India)
---
Estimating Musical Surprisal in Audio (CSL)
Mathias Rose Bjare (Johannes Kepler University), Giorgia Cantisani (PSL University), Stefan Lattner (Sony Computer Science Laboratories), Gerhard Widmer (Johannes Kepler University/Linz Institute of Technology)
---
Foundation Models Boost Low-Level Perceptual Similarity Metrics (SIE)
Abhijay Ghildyal (Portlant State University), Nabajeet Barman (Sony Interactive Entertainment (PlayStation), Saman Zadtootaghaj (Sony Interactive Entertainment (PlayStation)
---
High-Resolution Speech Restoration with Latent Diffusion Model (SSS)
Tushar Dhyani (Sony Europe B.V/University of Stuttgart), Florian Lux, Michele Mancusi (Sony Europe B.V), Giorgio Fabbro (Sony Europe B.V), Fritz Hohl (Sony Europe B.V), Ngoc Thang Vu (University of Stuttgart)
---
Hybrid Losses for Hierarchical Embedding Learning (CSL)
Haokun Tian (Queen Mary University of London), Stefan Lattner (Sony Computer Science Laboratories), Brian McFee (New York University), Charalampos Saitis (Queen Mary University of London)
---
Hypothesis Clustering and Merging: Novel MultiTalker Speech Recognition with Speaker Tokens (SGC)
Yosuke Kashiwagi (Sony Group Corporation), Hayato Futami (Sony Group Corporation), Emiru Tsunoo (Sony Group Corporation), Siddhant Arora (Carnegie Mellon University), Shinji Watanabe (Carnegie Mello University)
---
Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer (SSS, Sony Europe, Sony Research)
Michele Mancusi (Sony Europe B.V.), Yurii Halychanskyi (University of Illinois Urbana-Champaign), Kin Wai Cheuk (Sony AI), Eloi Moliner (Aalto University), Chieh-Hsin Lai (Sony AI), Stefan Uhlich (Sony Europe B.V.), Junghyun Koo (Sony AI), Marco A. Martínez-Ramírez (Sony AI), Wei-Hsiang Liao (Sony AI), Giorgio Fabbro (Sony Europe B.V.), Yuki Mitsufuji (Sony AI)
---
Music2Latent2: Audio Compression with Summary Embeddings and Autoregressive Decoding (CSL)
Marco Pasini (Queen Mary University of London), Stefan Lattner (Sony Computer Science Laboratories), George Fazekas (Queen Mary University of London)
---
Twenty-Five Years of MIR Research: Achievements, Practices, Evaluations, and Future Challenges (Sony AI)
Geoffroy Peeters (LTCI - Télécom Paris), Zafar Rafii (Audible Magic), Magdalena Fuentes (New York University), Zhiyao Duan (University of Rochester),
Emmanouil Benetos (Queen Mary University of London), Juhan Nam (KAIST), Yuki Mitsufuji (Sony AI)
---
Variable Bitrate Residual Vector Quantization for Audio Coding (Sony AI)
Yunkee Chae (Sony AI/IPAI), Woosung Choi (Sony AI), Yuhta Takida (Sony AI), Junghyun Koo (Sony AI), Yukara Ikemiya (Sony AI), Zhi Zhong (Sony Group Corporation), Kin Wai Cheuk (Sony AI), Marco A.Martínez-Ramírez (Sony AI), Kyogu Lee (IPAI/AIIS/Seoul National University), Wei-Hsiang Liao (Sony AI), Yuki Mitsufuji (Sony AI/Sony Group Corporation)
---
Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures (CSL)
Alain Riou (Télécom-Paris/Sony Computer Science Laboratories), Antonin Gagneré (Télé com-Paris), Gaëtan Hadjeres(Sony AI), Stefan Lattner(Sony Computer Science Laboratories), Geoffroy Peeters(Télécom-Paris)
