Abstract:This study considers the problem of detecting and locating an active talker's horizontal position from multichannel audio captured by a microphone array. We refer to this as active speaker detection and localization (ASDL). Our goal was to investigate the performance of spatial acoustic features extracted from the multichannel audio as the input of a convolutional recurrent neural network (CRNN), in relation to the number of channels employed and additive noise. To this end, experiments were conducted to compare the generalized cross-correlation with phase transform (GCC-PHAT), the spatial cue-augmented log-spectrogram (SALSA) features, and a recently-proposed beamforming method, evaluating their robustness to various noise intensities. The array aperture and sampling density were tested by taking subsets from the 16-microphone array. Results and tests of statistical significance demonstrate the microphones' contribution to performance on the TragicTalkers dataset, which offers opportunities to investigate audio-visual approaches in the future.

What problem does this paper attempt to address?

The problem this paper attempts to address is the detection and localization of the horizontal position of active speakers using multichannel audio captured by a microphone array. Specifically, the research aims to investigate the performance of spatial acoustic features extracted from multichannel audio as inputs to a Convolutional Recurrent Neural Network (CRNN), particularly under the influence of the number of channels used and additive noise. To achieve this goal, experiments compared the Generalized Cross-Correlation Phase Transform (GCC-PHAT), Spatial Clue Enhanced Log-Spectrogram (SALSA) features, and a recently proposed beamforming method, evaluating their robustness under different noise levels. The main contributions of the paper include: 1. **Feature Extraction**: Investigated various spatial acoustic features such as GCC-PHAT, SALSA and its variants, and beamforming methods. 2. **Performance Evaluation**: Experimentally evaluated the performance of these features under different numbers of microphones and noise conditions. 3. **Dataset**: Used the TragicTalkers dataset, which includes multi-view video and multichannel audio recordings, providing a rich testing environment. 4. **Network Architecture**: Employed CRNN as the backend network architecture to ensure fairness in feature comparison. Overall, the paper aims to explore and compare the effectiveness and robustness of different audio features for active speaker detection and localization in multichannel audio.

Audio Inputs for Active Speaker Detection and Localization via Microphone Array