Abstract:The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unlike them, in this work, we explore the ego-centric setting, where the heterogeneous sensors are embodied and could be moving with a human to facilitate speaker localization. Compared to the static scenario, the ego-centric setting is more realistic for smart-home applications e.g., a service robot. However, this also brings new challenges such as blurred images, frequent speaker disappearance from the field of view of the wearer, and occlusions. In this paper, we study egocentric audio-visual speaker DOA estimation and deal with the challenges mentioned above. Specifically, we propose a transformer-based audio-visual fusion method to estimate the relative DOA of the speaker to the wearer, and design a training strategy to mitigate the problem of the speaker disappearing from the camera's view. We also develop a new dataset for simulating the out-of-view scenarios, by creating a scene with a camera wearer walking around while a speaker is moving at the same time. The experimental results show that our proposed method offers promising performance in this new dataset in terms of tracking accuracy. Finally, we adapt the proposed method for the multi-speaker scenario. Experiments on EasyCom show the effectiveness of the proposed model for multiple speakers in real scenarios, which achieves state-of-the-art results in the sphere active speaker detection task and the wearer activity prediction task. The simulated dataset and related code are available at <a class="link-external link-https" href="https://github.com/KawhiZhao/Egocentric-Audio-Visual-Speaker-Localization" rel="external noopener nofollow">this https URL</a>.

Incorporating Lip Features into Audio-Visual Multi-Speaker DOA Estimation by Gated Fusion

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion.

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

Integration of audio-visual information for multi-speaker multimedia speaker recognition

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Audio Visual Speaker Localization from EgoCentric Views

Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion

DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization

Audio-Visual System for Robust Speaker Recognition.

Delay-and-Sum Beamforming Based Spatial Mapping for Multi-Source Sound Localization

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

DOA estimation of multiple speech sources based on the single-source point detection using an FOA microphone

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Correlating subword articulation with lip shapes for embedding aware audio-visual speech enhancement

Audio-visual Recognition of Overlapped speech for the LRS2 dataset