Cutting Edge

Dec 19, 2025

NeurIPS 2025 Review: Research Highlights from Sony Group

This year at NeurIPS 2025, Sony Group contributed across multiple parts of the program, sharing research that spans adaptation methods, dynamic scene reconstruction, evaluation in game environments, and music-related generative modeling. As one of the leading conferences in machine learning, NeurIPS offers a broad view of current research and emerging technical directions.

In this post, we highlight a subset of these contributions, with a focus on methods that help AI systems operate reliably in complex settings.

Main Track:

Spotlight
StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold

Authors: Zhizhong Li (Sony AI), Sina Sajadmanesh (Sony AI), Jingtao Li (Sony AI), Lingjuan Lyu (Sony AI)

StelLA studies how low-rank adapters behave during fine-tuning. Conventional LoRA updates can drift, which reduces stability and makes the adapted model harder to control. As the authors note, “low-rank adaptation still lags behind full fine-tuning in performance, partly due to its insufficient exploitation of the geometric structure underlying low-rank manifolds.”

StelLA introduces a Stiefel manifold constraint that keeps the update directions orthonormal. This containment helps the model maintain a consistent subspace throughout adaptation. The authors show that this structure leads to steadier fine-tuning, especially in low-data or low-rank settings. The approach offers a clearer way to guide LoRA updates without introducing additional complexity into the training workflow.

Sina Sajadmanesh (Sony AI), posing alongside the StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold

Enhancing 3D Reconstruction for Dynamic Scenes
Authors: Jisang Han, Honggyu An, Jaewoo Jung, Takuya Narihira (Sony AI), Junyoung Seo, Kazumi Fukuda (Sony AI), Chaehyun Kim, Sunghwan Hong, Yuki Mitsufuji (Sony AI/Sony Group Corporation), Seungryong Kim

While StelLA focuses on stability during adaptation, this paper examines stability in a different form—how to maintain coherent geometry when scenes contain motion.

Many reconstruction methods assume static scenes, which leads to depth errors and misalignment when objects move. This paper introduces Static–Dynamic Aligned Pointmaps (SDAP), a representation designed to keep both static and moving regions consistent over time. The method uses optical flow, occlusion masks, and region-specific losses so each part of the scene is aligned according to its motion.

The authors report clearer geometry and more stable pointmaps across datasets such as TUM-Dynamics, Bonn, and Sintel. Improvements are most evident in sequences with human or object movement, where standard pipelines tend to distort or lose structure. The work provides a more reliable approach for reconstruction tasks where motion is expected.

The paper also reports measurable gains: “experimental results and visual comparisons confirm that our approach delivers a significant boost in reconstruction accuracy.” The authors further conclude that their method “significantly enhances the quality of 3D reconstruction in dynamic environments achieving new state-of-the-art performance” across depth estimation, camera pose estimation, and 3D point alignment.

From left to right: Chaehyun Kim (KAIST AI), Honggyu An (KAIST AI), Jisang Han (KAIST AI), Kazumi Fukuda (Sony AI), posing alongside the Enhancing 3D Reconstruction for Dynamic Scenes poster

Datasets & Benchmarks Track:

VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance
Authors: Mohammad Reza Taesiri, Abhijay Ghildyal (Sony Interactive Entertainment), Saman Zadtootaghaj (Sony Interactive Entertainment), Nabajeet Barman (Sony Interactive Entertainment), Cor-Paul Bezemer

The video game industry, now a $257 billion market, faces a bottleneck in Quality Assurance (QA), which currently relies heavily on manual visual inspection to ensure bug-free experiences. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), considering the huge time and cost of testing ever increasing number of titles and huge games of high visual complexity.

To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain which tend to focus heavily on complex mathematical or textual reasoning tasks, overlooking essential visual comprehension tasks fundamental to video game QA. To address this, the authors have introduced VideoGameQA-Bench – a benchmark specifically designed to evaluate VLMs on real video game QA tasks rather than abstract reasoning puzzles. The dataset spans 4,786 questions across nine tasks, covering visual and UI unit testing, visual regression testing, glitch detection, “needle-in-a-haystack” glitch search in long videos, and automatic bug report generation for both images and videos drawn from 800+ games and controlled Unity scenes.

The authors evaluated a total of 16 state-of-the-art proprietary and open-weight VLMs on VideoGameQA-Bench for their performance and found a mixed picture. On the positive side, frontier models can already detect many visual glitches in images (up to ~83% accuracy) and videos (up to ~78%), and generate useful bug report descriptions for around half of real-world glitches—suggesting clear potential for tools that help QA teams triage and document issues faster. But the benchmark also exposes hard limits: models consistently struggle with fine-grained visual details (like exact body poses or small UI elements), with robust visual regression testing (best accuracy ~45%), and with precisely localizing short-lived glitches in longer clips.

Despite these limitations, the research highlights a potential future for AI as an assistant for human testers. While models struggled to pinpoint the exact timing of glitches in videos (the "needle-in-a-haystack" task), they proved surprisingly capable at generating descriptive bug reports, producing useful summaries for up to 50% of detected glitches. This benchmark is intended as a foundation for the ecosystem—both to guide model improvements and to inspire new hybrid workflows in which AI meaningfully augments expert human testers.

Creative AI Track:

Creative AI Panel: Art Content Creation: When demands are met by pipelines (or not)
Panelist: Zhuodi Cai, Ziyu Xu, Judah Goldfeder, Jiong Lim, Yonghyun Kim, Chris Donahue and Yuki Mitsufuji (Sony AI/Sony Group Corporation), chaired by Yingtao Tian

Yuki Mitsufuji joined the panel “Art Content Creation: When demands are met by pipelines (or not)” to discuss how generative AI is reshaping creative workflows. He emphasized that building societal infrastructure that guarantees fair compensation for creators is an important foundation for fostering creativity together with AI. The panel explored lessons from building end-to-end workflows, highlighting both efficiencies and vulnerabilities introduced by automation, and discussed how to balance iterative AI processes with human judgment in reshaping creative agency and aesthetics of AI-assisted work.

Yuki Mitsufuji (Sony AI/Sony Group Corporation), presenting at the Creative AI Panel: Art Content Creation: When demands are met by pipelines (or not)

'Studies for': A Human–AI Co-Creative Sound Artwork Using a Real-time Multi-channel Sound Generation Model
Authors: Chihiro Nagashima (Sony Group Corporation), Akira Takahashi (Sony Group Corporation), Zhi Zhong (Sony Group Corporation), Shusuke Takahashi (Sony Group Corporation), Yuki Mitsufuji (Sony AI/Sony Group Corporation)

This study introduces the idea that generative models can help preserve artistic practice in cases where traditional documentation falls short. As the paper notes, many of Evala’s spatial sound works “might never be reproduced after his death.” The installation uses an AI system trained entirely on his past works to maintain his stylistic identity, allowing it to generate new material in real time. The authors frame this as “a new form of archive” that preserves style while producing new sound, and one that could extend an artist’s work “beyond their physical existence.”

This paper presents a sound installation created with the artist Evala and exhibited at ICC Tokyo from December 2024 to March 2025. The installation uses a real-time, eight-channel version of SpecMaskGIT to generate continuous audio in the exhibition space. The system was trained entirely on more than 200 hours of Evala’s past works, reflecting his sound material while producing new outputs in real time.

The work centers on the aforementioned concept the authors describe as “a new form of archive,” —meaning: a speculative approach to preserving an artist’s identity beyond conventional documentation. Site-specific sound installations are difficult to archive because they depend on spatial conditions that cannot be fully reproduced. By training a model exclusively on an artist-curated dataset, the installation creates a system that can continue generating sound aligned with the artist’s style while introducing new variations.

The paper outlines three conditions that the collaboration identified as necessary for co-creative work:

• the model must be lightweight enough to allow rapid feedback cycles,

• the dataset must reflect the artist’s own practice, and

• the system must be capable of producing unexpected outcomes.

To support these needs, the team optimized SpecMaskGIT for real-time use by reducing the number of Transformer blocks, switching to a faster vocoder (Vocos), and designing a dual conditioning scheme using Evala’s signature audio and the titles of his earlier works. This guided the model away from producing collage-like recombinations of prior pieces and toward generating coherent but novel sound textures.

Throughout the three-month exhibition, the model produced uninterrupted eight-channel audio without pre-rendering. The installation demonstrates how artist feedback can shape a viable workflow for AI-supported sound art.

Chihiro Nagashima (Sony Group Corporation), presenting ‘Studies for’: A Human-AI Co-Creative Sound Artwork Using a Real-time Multi-channel Sound Generation Model at Sony booth

Large-Scale Training Data Attribution for Music Generative Models via Unlearning
Authors: Woosung Choi (Sony AI), Junghyun Koo (Sony AI), Kin Wai Cheuk (Sony AI), Joan Serrà (Sony AI), Marco A. Martínez-Ramírez (Sony AI), Yukara Ikemiya (Sony AI), Naoki Murata (Sony AI), Yuhta Takida (Sony AI), Wei-Hsiang Liao (Sony AI), Yuki Mitsufuji (Sony AI/Sony Group Corporation)

This work examines training data attribution (TDA) for text-to-music diffusion models trained on large datasets. TDA is a way to identify which training samples most influenced a generated output. This is especially important in music. As the paper notes, “Generative AI has demonstrated impressive capabilities across modalities… While democratizing creative work, these advancements have also raised concerns regarding authorship, copyright, attribution, and ethics. Notably, generative models can unintentionally reproduce copyrighted material, posing risks of intellectual property violations.”

To help combat this problem, the authors adapt an unlearning-based TDA method to a DiT text-to-music model trained on 115,000 tracks (or 4,356 hours of audio). Rather than retraining the model without each individual song, which would be computationally impossible at this scale, the approach removes (“unlearns”) the generated output from the model. It then measures how that change affects each training sample. This serves as a proxy for influence.

To check that the method behaves as expected, the authors run a self-influence experiment in which individual training tracks are unlearned. The goal is to see whether the method correctly identifies the removed track as one of the most influential. Finally, the method is compared with similarity-based approaches such as CLAP, CLEWS, LPIPS, and Representer Point Selection. Unlearning produces more focused attribution patterns and shows the strongest alignment with methods that rely on internal model representations.

The paper provides a practical path toward large-scale attribution for music generative models. It offers a way to examine how training data shapes a model’s output and supports clearer discussions around credit and accountability in generative audio.

Woosung Choi (Sony AI), presenting the Large-Scale Training Data Attribution for Music Generative Models via Unlearning poster

NeurIPS Community Engagement

GenProCC: 1st Workshop on Generative and Protective AI for Content Creation
Organizers: Wei-Yao Wang (Sony Group Corporation), Takashi Shibuya (Sony AI), Vali Lalioti, Qilu Wu (Sony Group Corporation), Wei Wang, Shusuke Takahashi (Sony Group Corporation), Yuki Mitsufuji (Sony AI/Sony Group Corporation)

The GenProCC workshop brought together researchers and creators to examine how generative AI supports new forms of content creation while introducing questions around copyright, safety, and responsible use. Discussions focused on emerging techniques across text, image, audio, and video generation, along with approaches for protecting creators and addressing IP concerns. The workshop provided a clear view of both the opportunities and challenges involved in building generative systems that operate responsibly in creative settings.

Wei-Yao Wang (Sony Group Corporation), leading the GenProCC: 1st Workshop on Generative and Protective AI for Content Creation

For more information, visit: https://neurips.cc/virtual/2025/workshop/109545
Workshop webpage: https://genprocc.github.io/

Invited Talk: Learning to Sense (L2S)
Speaker: Daisuke Iso (Sony AI)

The Learning to Sense invited talk examined a recent series of works on optimizing end-to-end imaging and sensing pipeline together with downstream machine-learning models, rather than treating them as separate stages. The discussion covered RAW-to-task pipelines, image processing and manipulation techniques, data augmentation methods, and energy-efficient sensing strategies for embedded or mobile systems. The session also highlighted evaluation methods, robustness considerations, and solutions to practical problems. Overall, the talk provided a practical view of how task-driven sensing is reshaping approaches to data acquisition and real-world deployment.

Daisuke Iso (Sony AI), presenting Learning to Sense (L2S)

For more information, visit: https://sites.google.com/view/l2s-workshop/home

We invite readers to explore the full papers and presentations to learn more about the methods, evaluations, and perspectives that shaped this year’s contributions at: https://www.sony.com/en/SonyInfo/technology/Conference/NeurIPS2025/

Share

  • LinkedIn
  • X
  • Facebook

Related article

Back to Stories