Real-time Architecture for Audio-Visual Active Speaker Detection.
Min Huang,Wen Wang,Zheyuan Lin,Fiseha B. Tesema,Shanshan Ji,Jason Gu,Minhong Wang,Wei Song,Te Li,Shiqiang Zhu
DOI: https://doi.org/10.1109/robio55434.2022.10011692
2022-01-01
Abstract:Continuously measuring the speaking state of users with robot in a human-robot Interaction(HRI) system improves metrics of interaction quality. Meanwhile, mainstream active speaker detection (ASD) algorithms emphasize achieving high AUCs at frame level in the AVA-Active Speaker dataset and pay less attention to get real-time performance in robotic systems. In this paper, we propose a model named FSDNet to keep a high AUC score in the AVA-Active Speaker dataset while reducing time cost, our model increase AUC score by 0.1% compared with the State-Of-The-Art and need only 75% running time. Furthermore, we put forward an architecture with a time-related prediction function to make our algorithm more effective and generative in interactive robotic systems. The code is released at https://github.com/huangmin9966/FSDNet_RealTimeArch.