CASA-Net: Cross-attention and Self-attention for End-to-End Audio-visual Speaker Diarization

Haodong Zhou,Tao Li,Jie Wang,Lin Li,Qingyang Hong
DOI: https://doi.org/10.1109/apsipaasc58517.2023.10317320
2023-01-01
Abstract:Audio-visual speaker diarization (AVSD) is a critical technique that segments audio-visual signals and assigns them to multiple speakers in practical scenarios. Thus, how to efficiently extract the cross-modal information from audio-visual signals is essential for AVSD. In this paper, we present CASA-Net, an embedding fusion method for end-to-end AVSD system. CASA-Net incorporates cross-attention (CA) module to capture cross-modal information in audio-visual signals, and utilizes self-attention (SA) module to learn the context information among audio-visual frames. On the development set of the Multimodal Information Based Speech Processing (MISP) Challenge 2022, the proposed CASA-Net achieved a diarization error rate (DER) of 13.18%, which is 1.56% lower compared to the concatenation (Concat) method. To further enhance performance, we utilized beamforming to integrate the available multi-channel audio information, along with data augmentation. By fusing multiple systems, we ultimately achieved a DER of 12.60% on the development set.
What problem does this paper attempt to address?