Active Speaker Detection Using Audio, Visual, and Depth Modalities: A Survey
Siti Nur Aisyah Mohd Robi,Muhammad Atiff Zakwan Mohd Ariffin,Mohd Azri Mohd Izhar,Norulhusna Ahmad,Hazilah Mad Kaidi
DOI: https://doi.org/10.1109/access.2024.3426670
IF: 3.9
2024-07-19
IEEE Access
Abstract:The rapid progress of multimodal signal processing in recent years has cleared the way for novel applications in human-computer interaction, surveillance, and telecommunication. Active Speaker Detection (ASD) is a critical pre-processing step with numerous applications such as voice recognition, speaker diarization, and noise reduction. This paper comprehensively reviews ASD, including various ASD methods and datasets based on these three modalities - audio, visual and/or depth modalities. ASD methods are broadly categorised into two categories: single modality ASD and multi-modality ASD. This review looks at the most common ASD modalities, which include audio-based ASD (A-ASD), visual-based ASD (V-ASD), audio-visual ASD (AV-ASD), and audio-visual-depth ASD (AVD-ASD). Each strategy is well-detailed, including model-based and neural network-based approaches. Finally, the challenges and future research opportunities are highlighted in order to expand its broader use.
computer science, information systems,telecommunications,engineering, electrical & electronic