Concurrent Speaker Detection: A multi-microphone Transformer-Based Approach

Amit Eliav,Sharon Gannot
2024-03-12
Abstract:We present a deep-learning approach for the task of Concurrent Speaker Detection (CSD) using a modified transformer model. Our model is designed to handle multi-microphone data but can also work in the single-microphone case. The method can classify audio segments into one of three classes: 1) no speech activity (noise only), 2) only a single speaker is active, and 3) more than one speaker is active. We incorporate a Cost-Sensitive (CS) loss and a confidence calibration to the training procedure. The approach is evaluated using three real-world databases: AMI, AliMeeting, and CHiME 5, demonstrating an improvement over existing approaches.
Audio and Speech Processing
What problem does this paper attempt to address?
This paper proposes a deep learning approach to address the Concurrent Speaker Detection (CSD) problem using a modified Transformer model. The CSD task involves identifying the presence of speakers and whether their activities overlap in an audio signal, categorizing audio segments into three classes: no speech activity (only noise), single-speaker activity, and multi-speaker activity. The main contributions of this paper include: 1. Extending the Transformer model to handle multi-microphone scenarios while also being able to handle single-microphone data. 2. Introducing a Cost-Sensitive loss function and confidence calibration during the training process to enhance classification accuracy. 3. Evaluating the proposed approach using three real-world databases (AMI, AliMeeting, and CHiME 5), demonstrating superior performance compared to existing methods. The paper provides an overview of previous research on speaker counting, recognition, and overlap speech detection using methods based on CNN, LSTM, and Transformer. The authors emphasize the importance of leveraging the full potential of multi-channel information to improve performance. The proposed model is based on the Vision Transformer (ViT) architecture but has been adaptively modified, such as using log-spectrogram as input and being able to handle mono and multi-channel audio. The model consists of embedding, Transformer, and classification components, trained using the Cross-Entropy loss function in conjunction with Label Smoothing and Cost-Sensitive loss. Experimental results demonstrate the superiority of the model across different databases, particularly in the task of concurrent speaker detection.