Abstract:The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unlike them, in this work, we explore the ego-centric setting, where the heterogeneous sensors are embodied and could be moving with a human to facilitate speaker localization. Compared to the static scenario, the ego-centric setting is more realistic for smart-home applications e.g., a service robot. However, this also brings new challenges such as blurred images, frequent speaker disappearance from the field of view of the wearer, and occlusions. In this paper, we study egocentric audio-visual speaker DOA estimation and deal with the challenges mentioned above. Specifically, we propose a transformer-based audio-visual fusion method to estimate the relative DOA of the speaker to the wearer, and design a training strategy to mitigate the problem of the speaker disappearing from the camera's view. We also develop a new dataset for simulating the out-of-view scenarios, by creating a scene with a camera wearer walking around while a speaker is moving at the same time. The experimental results show that our proposed method offers promising performance in this new dataset in terms of tracking accuracy. Finally, we adapt the proposed method for the multi-speaker scenario. Experiments on EasyCom show the effectiveness of the proposed model for multiple speakers in real scenarios, which achieves state-of-the-art results in the sphere active speaker detection task and the wearer activity prediction task. The simulated dataset and related code are available at <a class="link-external link-https" href="https://github.com/KawhiZhao/Egocentric-Audio-Visual-Speaker-Localization" rel="external noopener nofollow">this https URL</a>.

Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function

Incorporating Lip Features into Audio-Visual Multi-Speaker DOA Estimation by Gated Fusion

Spatial Sectorized Neural Network for 2-D DOA Estimation in the Full Azimuth

Audio Visual Speaker Localization from EgoCentric Views

Delay-and-Sum Beamforming Based Spatial Mapping for Multi-Source Sound Localization

Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array

Multi-target direction-of-arrival estimation of deep models with frame-level permutation invariant training in marine acoustic environment

Deep learning-based direction-of-arrival estimation for multiple speech sources using a small scale array

Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

A Deep Learning Method for DOA Estimation with Covariance Matrices in Reverberant Environments

A Multi-channel Speech Separation System for Unknown Number of Multiple Speakers

Time-Frequency Bins Selection for Direction of Arrival Estimation Based on Speech Presence Probability Learning

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Deep Learning Based Stage-wise Two-dimensional Speaker Localization with Large Ad-hoc Microphone Arrays

DOA estimation of multiple speech sources based on the single-source point detection using an FOA microphone

RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues

Joint DOA Estimation and Dereverberation Based on Multi-Channel Linear Prediction Filtering and Azimuth Sparsity

Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings

AcousticFusion: Fusing Sound Source Localization to Visual SLAM in Dynamic Environments