Abstract:Ambisonics is a scene-based spatial audio format that has several useful features compared to object-based formats, such as efficient whole scene rotation and versatility. However, it does not provide direct access to the individual source signals, so that these have to be separated from the mixture when required. Typically, this is done with linear spherical harmonics (SH) beamforming. In this paper, we explore deep-learning-based source separation on static Ambisonics mixtures. In contrast to most source separation approaches, which separate a fixed number of sources of specific sound types, we focus on separating arbitrary sound from specific directions. Specifically, we propose three operating modes that combine a source separation neural network with SH beamforming: refinement, implicit, and mixed mode. We show that a neural network can implicitly associate conditioning directions with the spatial information contained in the Ambisonics scene to extract specific sources. We evaluate the performance of the three proposed approaches and compare them to SH beamforming on musical mixtures generated with the musdb18 dataset, as well as with mixtures generated with the FUSS dataset for universal source separation, under both anechoic and room conditions. Results show that the proposed approaches offer improved separation performance and spatial selectivity compared to conventional SH beamforming.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of separating sound sources in specific directions in Ambisonics (a scene - based spatial audio format). The Ambisonics format has many advantages over object - based formats, such as efficient full - scene rotation and versatility. However, the Ambisonics format does not directly provide access to individual sound source signals, so it is necessary to separate these sound sources from the mixed signal. Traditional linear spherical harmonic (SH) beamforming methods can achieve this goal, but their performance is limited. The paper proposes a deep - learning - based method for separating sound sources in specific directions from static Ambisonics mixed signals. Unlike most sound source separation methods, which usually separate a fixed number of specific - type sound sources, this paper focuses on separating arbitrary sounds from specific directions. Specifically, the paper proposes three operation modes, combining a sound - source - separation neural network and SH beamforming techniques: a refinement mode, an implicit mode, and a hybrid mode. Through these methods, the neural network can implicitly associate the conditional directions with the spatial information in the Ambisonics scene, thereby extracting specific sound sources. The paper evaluates the performance of these three methods and compares them with the traditional SH beamforming method, using music - mixed signals generated by the musdb18 dataset and general - sound - source - mixed signals generated by the FUSS dataset, and tests them under anechoic and room conditions. The results show that the proposed methods are superior to the traditional SH beamforming method in terms of separation performance and spatial selectivity.

Direction Specific Ambisonics Source Separation with End-To-End Deep Learning

Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array

Deep Neural Network Based Audio Source Separation

A Source Separation Approach for the Combined SBA Signals in the Joint Representation of OBA and SBA

SpaIn-Net: Spatially-Informed Stereophonic Music Source Separation

Localization Based Stereo Speech Source Separation Using Probabilistic Time-Frequency Masking and Deep Neural Networks

Binaural Rendering of Ambisonic Signals by Neural Networks

Localization Based Stereo Speech Separation Using Deep Networks.

Delay-and-Sum Beamforming Based Spatial Mapping for Multi-Source Sound Localization

Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation

Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization

Music source separation conditioned on 3D point clouds

Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information

Deep Learning Based Binaural Speech Separation in Reverberant Environments

Real-time binaural speech separation with preserved spatial cues

Ambisonizer: Neural Upmixing as Spherical Harmonics Generation

End-to-End Paired Ambisonic-Binaural Audio Rendering

A Multi-Source Separation Approach Based on DOA Cue and DNN

Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation

Binaural Reverberant Speech Separation Based on Deep Neural Networks

Binaural Angular Separation Network