Abstract:Data-based and learning-based sound source localization (SSL) has shown promising results in challenging conditions, and is commonly set as a classification or a regression problem. Regression-based approaches have certain advantages over classification-based, such as continuous direction-of-arrival estimation of static and moving sources. However, multi-source scenarios require multiple regressors without a clear training strategy up-to-date, that does not rely on auxiliary information such as simultaneous sound classification. We investigate end-to-end training of such methods with a technique recently proposed for video object detectors, adapted to the SSL setting. A differentiable network is constructed that can be plugged to the output of the localizer to solve the optimal assignment between predictions and references, optimizing directly the popular CLEAR-MOT tracking metrics. Results indicate large improvements over directly optimizing mean squared errors, in terms of localization error, detection metrics, and tracking capabilities.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficiency of training strategies encountered when using deep - learning methods for Sound Source Localization (SSL) in multi - sound - source scenarios. Specifically, the existing regression methods face the following challenges when dealing with multi - sound - source situations: 1. **Training problems in multi - sound - source scenarios**: Current methods require multiple regressors to handle the assumed maximum number of sound sources, which leads to the permutation problem between sound sources and regression outputs, affecting the training effect and the positioning accuracy during the inference process. 2. **Lack of an effective activity detection mechanism**: During the inference process, an additional activity detection mechanism is required to handle the continuous DOA (Direction of Arrival) stream. 3. **Limitations of the optimization objective**: Existing methods usually only optimize the spatial positioning error without considering the source detection term, which limits the improvement of overall performance. To overcome these challenges, the paper proposes an end - to - end training strategy. By introducing a differentiable tracking module (Differentiable Tracking - Based Training), it directly optimizes the CLEAR - MOT tracking metric, thereby improving positioning accuracy, detection performance, and tracking ability. Specific contributions include: - **Combining the source detection term**: Adding the source detection term to the loss function to improve overall performance. - **Avoiding permutation errors**: Avoiding permutation errors by integrating tracking - inspired loss terms. - **Adaptability under dynamically changing conditions**: Providing an end - to - end training strategy that can handle the dynamic change in the number of sound sources, which is suitable for annotated recordings in real - life. This method not only improves positioning accuracy and detection performance but also performs well in multi - sound - source tracking. Experimental results show that this method significantly outperforms methods that only optimize the mean - squared error on multiple evaluation metrics and is competitive under dynamic and reverberant conditions.

Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers

Position tracking of a varying number of sound sources with sliding permutation invariant training

Sound source localization based on multi-task learning and image translation network

Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection

Deep Learning-Enabled High-Resolution and Fast Sound Source Localization in Spherical Microphone Array System

BeamLearning: An end-to-end deep learning approach for the angular localization of sound sources using raw multichannel acoustic pressure data

A Time-domain End-to-End Method for Sound Source Localization Using Multi-Task Learning

The LOCATA Challenge: Acoustic Source Localization and Tracking

SSLIDE: Sound Source Localization for Indoors Based on Deep Learning

A Cascaded Multiple-Speaker Localization and Tracking System

Multilevel B-Splines-Based Learning Approach for Sound Source Localization

New Direct Approaches to Robust Sound Source Localization

Adaptive high-precision sound source localization at low frequencies based on convolutional neural network

Dual input neural networks for positional sound source localization

Eliminating Quantization Errors in Classification-Based Sound Source Localization

Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network

Sound source localization based on residual network and channel attention module

SSLNet: A Network for Cross-Modal Sound Source Localization in Visual Scenes

Multitask learning of time-frequency CNN for sound source localization

A Generalized Network Based on Multi-Scale Densely Connection and Residual Attention for Sound Source Localization and Detection.

Unsupervised Sound Localization via Iterative Contrastive Learning