Direction Specific Ambisonics Source Separation with End-To-End Deep Learning

Francesc Lluís,Nils Meyer-Kahlen,Vasileios Chatziioannou,Alex Hofmann
2023-06-20
Abstract:Ambisonics is a scene-based spatial audio format that has several useful features compared to object-based formats, such as efficient whole scene rotation and versatility. However, it does not provide direct access to the individual source signals, so that these have to be separated from the mixture when required. Typically, this is done with linear spherical harmonics (SH) beamforming. In this paper, we explore deep-learning-based source separation on static Ambisonics mixtures. In contrast to most source separation approaches, which separate a fixed number of sources of specific sound types, we focus on separating arbitrary sound from specific directions. Specifically, we propose three operating modes that combine a source separation neural network with SH beamforming: refinement, implicit, and mixed mode. We show that a neural network can implicitly associate conditioning directions with the spatial information contained in the Ambisonics scene to extract specific sources. We evaluate the performance of the three proposed approaches and compare them to SH beamforming on musical mixtures generated with the musdb18 dataset, as well as with mixtures generated with the FUSS dataset for universal source separation, under both anechoic and room conditions. Results show that the proposed approaches offer improved separation performance and spatial selectivity compared to conventional SH beamforming.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of separating sound sources in specific directions in Ambisonics (a scene - based spatial audio format). The Ambisonics format has many advantages over object - based formats, such as efficient full - scene rotation and versatility. However, the Ambisonics format does not directly provide access to individual sound source signals, so it is necessary to separate these sound sources from the mixed signal. Traditional linear spherical harmonic (SH) beamforming methods can achieve this goal, but their performance is limited. The paper proposes a deep - learning - based method for separating sound sources in specific directions from static Ambisonics mixed signals. Unlike most sound source separation methods, which usually separate a fixed number of specific - type sound sources, this paper focuses on separating arbitrary sounds from specific directions. Specifically, the paper proposes three operation modes, combining a sound - source - separation neural network and SH beamforming techniques: a refinement mode, an implicit mode, and a hybrid mode. Through these methods, the neural network can implicitly associate the conditional directions with the spatial information in the Ambisonics scene, thereby extracting specific sound sources. The paper evaluates the performance of these three methods and compares them with the traditional SH beamforming method, using music - mixed signals generated by the musdb18 dataset and general - sound - source - mixed signals generated by the FUSS dataset, and tests them under anechoic and room conditions. The results show that the proposed methods are superior to the traditional SH beamforming method in terms of separation performance and spatial selectivity.