Abstract:The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unlike them, in this work, we explore the ego-centric setting, where the heterogeneous sensors are embodied and could be moving with a human to facilitate speaker localization. Compared to the static scenario, the ego-centric setting is more realistic for smart-home applications e.g., a service robot. However, this also brings new challenges such as blurred images, frequent speaker disappearance from the field of view of the wearer, and occlusions. In this paper, we study egocentric audio-visual speaker DOA estimation and deal with the challenges mentioned above. Specifically, we propose a transformer-based audio-visual fusion method to estimate the relative DOA of the speaker to the wearer, and design a training strategy to mitigate the problem of the speaker disappearing from the camera's view. We also develop a new dataset for simulating the out-of-view scenarios, by creating a scene with a camera wearer walking around while a speaker is moving at the same time. The experimental results show that our proposed method offers promising performance in this new dataset in terms of tracking accuracy. Finally, we adapt the proposed method for the multi-speaker scenario. Experiments on EasyCom show the effectiveness of the proposed model for multiple speakers in real scenarios, which achieves state-of-the-art results in the sphere active speaker detection task and the wearer activity prediction task. The simulated dataset and related code are available at <a class="link-external link-https" href="https://github.com/KawhiZhao/Egocentric-Audio-Visual-Speaker-Localization" rel="external noopener nofollow">this https URL</a>.

The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

Egocentric Auditory Attention Localization in Conversations

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Identification of Conversation Partners from Egocentric Video

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

Multi-Modal Gaze Following in Conversational Scenarios

Audio Visual Speaker Localization from EgoCentric Views

Revisiting audio visual scene-aware dialog

Heterogeneous Interactive Graph Network for Audio-Visual Question Answering

End-to-end audio visual scene-aware dialog using multimodal attention-based video features

Egocentric Audio-Visual Object Localization

How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model

Retrieval-Augmented Egocentric Video Captioning

AVA-AVD: Audio-Visual Speaker Diarization in the Wild

Audio-Visual Segmentation

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

EgoCom: A Multi-Person Multi-Modal Egocentric Communications Dataset

Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model

Multi-Granularity Relational Attention Network for Audio-Visual Question Answering