Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers

Sharath Adavanne,Archontis Politis,Tuomas Virtanen
DOI: https://doi.org/10.48550/arXiv.2111.00030
2021-10-30
Abstract:Data-based and learning-based sound source localization (SSL) has shown promising results in challenging conditions, and is commonly set as a classification or a regression problem. Regression-based approaches have certain advantages over classification-based, such as continuous direction-of-arrival estimation of static and moving sources. However, multi-source scenarios require multiple regressors without a clear training strategy up-to-date, that does not rely on auxiliary information such as simultaneous sound classification. We investigate end-to-end training of such methods with a technique recently proposed for video object detectors, adapted to the SSL setting. A differentiable network is constructed that can be plugged to the output of the localizer to solve the optimal assignment between predictions and references, optimizing directly the popular CLEAR-MOT tracking metrics. Results indicate large improvements over directly optimizing mean squared errors, in terms of localization error, detection metrics, and tracking capabilities.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficiency of training strategies encountered when using deep - learning methods for Sound Source Localization (SSL) in multi - sound - source scenarios. Specifically, the existing regression methods face the following challenges when dealing with multi - sound - source situations: 1. **Training problems in multi - sound - source scenarios**: Current methods require multiple regressors to handle the assumed maximum number of sound sources, which leads to the permutation problem between sound sources and regression outputs, affecting the training effect and the positioning accuracy during the inference process. 2. **Lack of an effective activity detection mechanism**: During the inference process, an additional activity detection mechanism is required to handle the continuous DOA (Direction of Arrival) stream. 3. **Limitations of the optimization objective**: Existing methods usually only optimize the spatial positioning error without considering the source detection term, which limits the improvement of overall performance. To overcome these challenges, the paper proposes an end - to - end training strategy. By introducing a differentiable tracking module (Differentiable Tracking - Based Training), it directly optimizes the CLEAR - MOT tracking metric, thereby improving positioning accuracy, detection performance, and tracking ability. Specific contributions include: - **Combining the source detection term**: Adding the source detection term to the loss function to improve overall performance. - **Avoiding permutation errors**: Avoiding permutation errors by integrating tracking - inspired loss terms. - **Adaptability under dynamically changing conditions**: Providing an end - to - end training strategy that can handle the dynamic change in the number of sound sources, which is suitable for annotated recordings in real - life. This method not only improves positioning accuracy and detection performance but also performs well in multi - sound - source tracking. Experimental results show that this method significantly outperforms methods that only optimize the mean - squared error on multiple evaluation metrics and is competitive under dynamic and reverberant conditions.