2021 IEEE International
Conference on Acoustics,
Speech and Signal Processing

The international Conference on Acoustics, Speech, & Signal Processing (ICASSP), is the IEEE Signal Processing Society's flagship conference on signal processing and its applications. The 46th edition of ICASSP will be held in the dynamic city of Toronto, Canada; one of the most multicultural and cosmopolitan cities in the world. The program will include keynotes by pre-eminent international speakers, cutting-edge tutorial topics, and forward-looking special sessions. ICASSP also provides a great networking opportunity with a wide range of like-minded professionals from academia, industry and government organizations.

June 06 ~ 11, 2021
(ICASSP-2021 is a Virtual-only Conference)

It is Sony's pleasure to become a Platinum sponsor of ICASSP2021.

In light of the continuing COVID-19 pandemic, the International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021) will again be held as a fully virtual conference.
We appreciate the work of the program committee and everyone involved in making this conference happen.

For many years, Sony has been developing cutting-edge R&D in the fields of audio, acoustics, and AI. As a creative entertainment company, Sony seeks to use AI technology to unleash the potential of human creativity. Sony is venturing into various fields and actively seeking the application of these technologies so that they can benefit the whole of society.

As one of the sponsors of ICASSP 2021, Sony will give a dedicated workshop and exhibit some of our latest research activities and business use cases on this site. Please join us for these workshops - we are looking forward to seeing you there.

We sincerely wish all participants a great conference this year and we are looking forward to seeing you at the next ICASSP.


Technology 01
Content Restoration/
Protection/Generation Technologies

The first work is titled "D3Net: Densely connected multidilated DenseNet for music source separation". Dense prediction tasks, such as semantic segmentation and audio source separation, involve a high-resolution input and prediction. In order to efficiently model local and global structure of data, we propose a dense multiresolution learning by combining a dense skip connection topology and a novel multidilated convolution. The proposed D3Net achieves state-of-the-art performance on a music source separation task, where the goal is separate individual instrumental sounds from a music. Demo is available in another video, so please check it out!

In the second paper, we investigate the adversarial attack on audio source separation problem. We found that it is possible to severely degrade the separation performance by adding imperceptible noise to the input mixture under a white box condition, while under a black box condition, source separation methods exabit certain level of robustness. We believe that this work is important for understanding the robustness of source separation models, as well as for developing content protection methods against the abuse of separated signals.

The last paper is on an investigation of posterior collapse in Gaussian VAE. VAE is one of the famous generative models for its tractability and stability of training, but it suffers from the problem of posterior collapse. We've investigated the cause of posterior collapse in Gaussian VAE from the viewpoint of local Lipschitz smoothness. Then, we proposed a modified ELBO-based objective function which adapts hyper-parameter in ELBO automatically. Our new objective function enables to prevent the over-smoothing of the decoder, or posterior collapse.

  • Naoya Takahashi

    Sony R&D Center Tokyo, Japan

    He received Ph.D. from University of Tsukuba, Japan, in 2020. From 2015 to 2016, he had worked at the Computer Vision Lab and Speech Processing Group at ETH Zurich as a visiting researcher. Since he joined Sony Corporation in 2008, he has performed research in audio, computer vision and machine learning domains. In 2018, he won the Sony Outstanding Engineer Award, which is the highest form of individual recognition for Sony Group engineers. His current research interests include audio source separation, semantic segmentation, video highlight detection, event detection, speech recognition and music technologies.

Technology 02
Video Colorization for Content Revitalization

Sony is highly interested in revitalizing old contents, and video colorization is one of such efforts. However, video colorization is a challenging task with temporal coherence and user controllability issues. We introduce our unique reference-based video colorization and demonstrate that it can alleviate the issues described above, helping revitalize the old black-and-white videos.

  • Andrew Shin

    Sony R&D Center Tokyo, Japan

    He received Ph.D from The University of Tokyo in 2017, after which he joined Sony. He has been working on development of Sony's deep learning framework Neural Network Libraries, as well as developing machine learning tools to support contents creation for entertainment business, and conducting core research.

  • Naofumi Akimoto

    Sony R&D Center Tokyo, Japan

    He received master degree of engineering from Keio University in 2020, after which he joined Sony. He has been working on research for machine learning technologies for image and video enhancement for Sony's entertainment business.

Technology 03
Mixed Precision Quantization of Deep Neural Networks

In order to deploy AI applications to edge devices, it is essential to reduce DNNs footprints. Sony R&D Center is researching and developing methods and tool for this purpose. In particular, we have been investigating new approaches for training quantized DNN which leads to a reduction of the memory, computations and energy footprints. For example, our paper presented at the ICLR2020 conference entitled "Mixed Precision DNNs: All you need is a good parametrization" shows how we can train DNNs to optimally distribute the bitwidths across layers given a specific memory budget. The resulting mixed precision MobileNetV2 allows to reduce by nearly 10x the required memory without significant loss of accuracy on the ImageNet classification benchmark.

  • Fabien Cardinaux

    Sony R&D Center Europe, Germany

    Dr. Fabien Cardinaux is leading a R&D team at the Sony R&D Center Europe in Stuttgart (Germany). Prior to joining Sony in 2011, he has worked as a Postdoc at the University of the Sheffield (UK). In 2005, he obtained a PhD from EPFL (Switzerland) for his work on machine learning methods applied to face authentication. His current research interests lie in deep neural network footprint reduction, neural architecture search and audio content creation. Fabien contributes to academic research by regularly publishing and reviewing for major machine learning conferences.

Technology 04
Highly Efficient Realtime Visual Sensing Applications with Event-based Sensors

Inspired by the human eye, event-based vision sensors (EVS) are an emerging technology that addresses increasingly challenging scenarios faced by today's conventional cameras. They react quickly to sudden changes, handle extreme lighting conditions and operate with very little power. We have drastically reduced the EVS pixel and have now reached a size where we can combine and integrate it with state-of-the art image sensors to get the best of both worlds. We are developing software solutions to facilitate adaptation of the EVS technology in various markets. The movie in this page introduces some of our current works including high-speed tracking, high-speed depth sensing, visual light communication and efficient event-based processing.

Technology 05
AI Tools for Music Composition – A collaboration with Uèle Lamore

"Heqet's Shadow: Return of Glycon" is a collaborative EP between French American composer Uèle Lamore, and the "Music and Artificial Intelligence team" of Sony Computer Science Laboratories (Sony CSL), a lab that develops innovative music production software using state-of-the-art Artificial Intelligence algorithms. This new generation of music production tools redefines the artist/technology relationship by putting Artificial Intelligence at the service of the artist, and act as real studio companions – tools with a personality, essentially. Here, the idea was to showcase a series of prototypes that further develops Sony CSL's artist-centric vision.

"It was such a great experience working with the Sony CSL researchers, very refreshing and exciting," she says. "For a producer, helping to develop a pioneering new kind of music tool can only give you a sense of immense pride."

In homage to Sony's rich history and involvement in video games, Heqet's Shadow: Return of Glycon is the soundtrack of the imaginary video game of the same title. Lamore started with pure experimentation, just intuitively using the tools in a controlled environment. The goal was familiarization and creating a bank of percussion and drum samples to be deployed throughout. "I wanted sounds that were very abstract and in the family of toys or small orchestra percussion," she says; and DrumGAN[1] gave her precisely that.

The next step was doing the same with NOTONO[2] and creating what she terms a "character" sound for the track ‘Corruption Of The Toad Forest'. The tool's extreme treatment of sound resulted in the creation of very phased, filtered, samples with a peculiar acoustic quality, perfect to represent "corruption". With this in place, Lamore set about building a sonic narrative over the EP's four tracks, employing a number of Sony CSL's tools, namely DrumNet[3], BassNet[4], Flow Machines[5] to create a rich, vibrant palette and vintage quality.

"I was able to really do exactly what I wanted, from beginning to finish. I had their total and complete trust, it was such a positive work environment. What was so great about this, is that the guys on the Sony CSL team are real music lovers. I became good friends with them as we spoke the same language: we all wanted to just experiment, make good music, push the tools forward to serve the music, help each other during the process and have fun," she says of the project.

[1] Nistal et al., DrumGAN: Synthesis of Drum Sounds With Timbral Feature Conditioning Using Generative Adversarial Networks, ISMIR 2020
- Drum sounds synthesizer
- GAN-based model, conditioned on perceptual features

[2] Bazin et al., Spectrogram Inpainting for Interactive Generation of Instrument Sounds,
2020 Joint Conference on AI Music Creativity
- Interactive tool for generating instrumental one-shots
- VQ-VAE operating on spectrograms, conditioned on instrument labels
- Start from a sound you like and iteratively modify it through inpainting

[3] Lattner et al., DrumNet: High-Level Control of Drum Track Generation using Learned Patterns of Rhythmic Interaction, WASPAA 2020
- Interactive tool for generating drum/bass patterns to existing audio.
- Convolutional Gated Autoencoder representing variable-length, dynamically changing rhythmic/melodic patterns as stable points in its latent space.

[4] Graachten et al., BassNet: A Variational Gated Autoencoder for Conditional Generation of Bass Guitar Tracks with Learned Interactive Control
- Tempo invariant representations: output to adapt to the tempo of the audio input automatically.
- Explore latent spaces to quickly generate drum/bass accompaniments to full-length songs.

- Composition assistant (melody, chords)
- Markov Chain-based model

Business use case

Case 01
Sony's World's First Intelligent Vision Sensors with AI Processing Functionality Enabling High-Speed Edge AI Processing and Contributing to Building of Optimal Systems Linked with the Cloud

Sony Corporation announced two models of intelligent vision sensors, the first image sensors in the world to be equipped with AI processing functionality. Including AI processing functionality on the image sensor itself enables high-speed edge AI processing and extraction of only the necessary data, which, when using cloud services, reduces data transmission latency, addresses privacy concerns, and reduces power consumption and communication costs.

Fig. Intelligent Vision Sensor

Case 02
360 Reality Audio

360 Reality Audio is a new music experience that uses Sony's object-based 360 Spatial Sound technologies.
Individual sounds such as vocals, chorus, piano, guitar, bass and even sounds of the live audience can be placed in a 360 spherical sound field, giving artists and creators a new way to express their creativity. Listeners can be immersed in a field of sound exactly as intended by artists and creators.

360 Spatial Sound Personalization for more immersive 360 Reality Audio experience
Optimization by personal ear data uses Sony's original estimation algorithm utilizing machine learning.
We analyze the listener's hearing characteristics by estimating the 3D shape of the ear based a photo of their ear through "Sony | Headphones Connect" app.

Case 03
Speak to Chat

"Speak-to-Chat" UX Overview :
Speak-to-Chat allows you to speak to others while wearing your headphone. When you speak to others, AI detects your speech and changes mode for be able to conversation.

"Speak-to-Chat" Algorithm :
"Speak-to-Chat" is realized by Sony's unique technologies.
Precise Voice Pickup Technology captures user's voice even in noisy situations.
Speech Detect Algorithm is created by using machine learning technology.


Publication 01
Adversarial Attacks on Audio Source Separation

Naoya Takahashi, Shota Inoue*, Yuki Mitsufuji
*University of Tsukuba


Despite the excellent performance of neural-network-based audio source separation methods and their wide range of applications, their robustness against intentional attacks has been largely neglected. In this work, we reformulate various adversarial attack methods for the audio source separation problem and intensively investigate them under different attack conditions and target models. We further propose a simple yet effective regularization method to obtain imperceptible adversarial noise while maximizing the impact on separation quality with low computational complexity. Experimental results show that it is possible to largely degrade the separation quality by adding imperceptibly small noise when the noise is crafted for the target model. We also show the robustness of source separation models against a black-box attack. This study provides potentially useful insights for developing content protection methods against the abuse of separated signals and improving the separation performance and robustness.

Publication 02
All for One and One for All: Improving Music Separation by Bridging Networks

Ryosuke Sawata, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji


This paper proposes several improvements for music separation with deep neural networks (DNNs), namely a multi-domain loss (MDL) and two combination schemes. First, by using MDL we take advantage of the frequency and time domain representation of audio signals. Next, we utilize the relationship among instruments by jointly considering them. We do this on the one hand by modifying the network architecture and introducing a CrossNet structure. On the other hand, we consider combinations of instrument estimates by using a new combination loss (CL). MDL and CL can easily be applied to many existing DNN-based separation methods as they are merely loss functions which are only used during training and which do not affect the inference step. Experimental results show that the performance of Open-Unmix (UMX), a well-known and state-of-the-art open source library for music separation, can be improved by utilizing our above schemes. Our modifications of UMX are open-sourced together with this paper (Pytorch/NNabla).

Publication 03
End-to-end lyrics Recognition with Voice to Singing Style Transfer

Sakya Basak*, Shrutina Agarwal*, Sriram Ganapathy*, Naoya Takahashi
*Indian Institute of Science


Automatic transcription of monophonic/polyphonic music is a challenging task due to the lack of availability of large amounts of transcribed data. In this paper, we propose a data augmentation method that converts natural speech to singing voice based on vocoder based speech synthesizer. This approach, called voice to singing (V2S), performs the voice style conversion by modulating the F0 contour of the natural speech with that of a singing voice. The V2S model based style transfer can generate good quality singing voice thereby enabling the conversion of large corpora of natural speech to singing voice that is useful in building an E2E lyrics transcription system. In our experiments on monophonic singing voice data, the V2S style transfer provides a significant gain (relative improvements of 21%) for the E2E lyrics transcription system. We also discuss additional components like transfer learning and lyrics based language modeling to improve the performance of the lyrics transcription system.

Publication 04
ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection

Kazuki Shimada, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji


Neural-network (NN)-based methods show high performance in sound event localization and detection (SELD). Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target. The two-branch representation with a single network has to decide how to balance the two objectives during optimization. Using two networks dedicated to each task increases system complexity and network size. To address these problems, we propose an activity-coupled Cartesian DOA (ACCDOA) representation, which assigns a sound event activity to the length of a corresponding Cartesian DOA vector. The ACCDOA representation enables us to solve a SELD task with a single target and has two advantages: avoiding the necessity of balancing the objectives and model size increase. In experimental evaluations with the DCASE 2020 Task 3 dataset, the ACCDOA representation outperformed the two-branch representation in SELD metrics with a smaller network size. The ACCDOA-based SELD system also performed better than state-of-the-art SELD systems in terms of localization and location-dependent detection.

Publication 05
Gaussian Kernelized Self-Attention for Long Sequence Data and its Application to CTC-Based Speech Recognition

Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe*
*Johns Hopkins University


Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability. However, it is also known that the accuracy degrades when applying SA to long sequence data. This is mainly due to the length mismatch between the inference and training data because the training data are usually divided into short segments for efficient training. To mitigate this mismatch, we propose a new architecture, which is a variant of the Gaussian kernel, which itself is a shift-invariant kernel. First, we mathematically demonstrate that self-attention with shared weight parameters for queries and keys is equivalent to a normalized kernel function. By replacing this kernel function with the proposed Gaussian kernel, the architecture becomes completely shift-invariant with the relative position information embedded using a frame indexing technique. The proposed Gaussian kernelized SA was applied to connectionist temporal classification (CTC) based ASR. An experimental evaluation with the Corpus of Spontaneous Japanese (CSJ) and TEDLIUM 3 benchmarks shows that the proposed SA achieves a significant improvement in accuracy (e.g., from 24.0% WER to 6.0% in CSJ) in long sequence data without any windowing techniques.

Publication 06
Making Punctuation Restoration Robust and Fast with Multi-Task Learning and Knowledge Distillation

Michael Hentschel, Emiru Tsunoo, Takao Okuda


In punctuation restoration, we try to recover the missing punctuation from automatic speech recognition output to improve understandability. Currently, large pre-trained transformers such as BERT set the benchmark on this task but there are two main drawbacks to these models. First, the pre-training data does not match the output data from speech recognition that contains errors. Second, the large number of model parameters increases inference time. To address the former, we use a multi-task learning framework with ELECTRA, a recently proposed improvement on BERT, that has a generator-discriminator structure. The generator allows us to inject errors into the training data and, as our experiments show, this improves robustness against speech recognition errors during inference. To address the latter, we investigate knowledge distillation and parameter pruning of ELECTRA. In our experiments on the IWSLT 2012 benchmark data, a model with less than 11% the size of BERT achieved better performance while having an 82% faster inference time.

Contributions to Research Challenges

MDX Challenge

The Music Demixing (MDX) Challenge continues the successful series of music tasks of Signal Separation Evaluation Campaign (SiSEC) and brings it to the next level. We are delightful to be the sponsor and part of the organizing team for this exciting challenge. The MDX Challenge will be held as a satellite event for ISMIR 2021. In the challenge, our novel music source separation method called X-UMX, which we present at this year's ICASSP, is one of the baseline systems that participants can use and improve. The implementation of X-UMX can be found here (Pytorch/NNabla). Please find more detailed information about the MDX Challenge here.

DCASE2021 Challenge

DCASE 2021 task 3 focuses on the detection and localization of acoustic scenes and events. In this task, our innovative training scheme called Activity-Coupled Cartesian Direction of Arrival (ACCDOA) will serve as the baseline system. The implementation is publicly available in the git repository.

Recruit information for ICASSP-2021

We look forward to highly motivated individuals applying to Sony so that we can work together to fill the world with emotion and pioneer the future with dreams and curiosity. Join us and be part of a diverse, innovative, creative, and original team to inspire the world. If you are interested in working with us, please click here for more open positions of job and internship.

Page Top