Audio-Visual Talker Localization in Video for Spatial Sound Reproduction

Davide Berghi,Philip J. B. Jackson

2024-06-02

Abstract:Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating the automatic extraction of the positional metadata of the talker relative to the camera's reference frame. With the integration of the visual modality, this study expands upon our previous investigation focused solely on audio-based active speaker detection and localization. Our experiments compare conventional audio-visual approaches for active speaker detection that leverage monaural audio, our previous audio-only method that leverages multichannel recordings from a microphone array, and a novel audio-visual approach integrating vision and multichannel audio. We found the role of the two modalities to complement each other. Multichannel audio, overcoming the problem of visual occlusions, provides a double-digit reduction in detection error compared to audio-visual methods with single-channel audio. The combination of multichannel audio and vision further enhances spatial accuracy, leading to a four-percentage point increase in F1 score on the Tragic Talkers dataset. Future investigations will assess the robustness of the model in noisy and highly reverberant environments, as well as tackle the problem of off-screen speakers.

Audio and Speech Processing,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issue of automatic extraction of positional metadata for audio objects in 3D audio-video production. Specifically, it focuses on detecting and locating speakers in videos to automatically extract the positional metadata of speakers relative to the camera reference frame. The research extends previous methods that were based solely on audio by incorporating visual modalities (video data) and multi-channel audio to improve the accuracy of speaker position detection. The main issues addressed include: 1. **Speaker Detection and Localization**: Automatically detecting and locating speakers in videos to generate positional metadata of the speakers. 2. **Multimodal Fusion**: Utilizing both video and multi-channel audio data to overcome the limitations of single modalities (such as mono audio or video only) and improve detection accuracy. 3. **Robustness Enhancement**: Maintaining high detection rates and localization accuracy even in complex situations such as occlusions. Comparative experiments have validated the superiority of the method that combines multi-channel audio and visual information in the tasks of speaker detection and localization, and demonstrated its performance on the Tragic Talkers dataset.

Audio-Visual Talker Localization in Video for Spatial Sound Reproduction

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Audio Inputs for Active Speaker Detection and Localization via Microphone Array

Visually Supervised Speaker Detection and Localization via Microphone Array

Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras

Leveraging Visual Supervision for Array-based Active Speaker Detection and Localization

Audio-visual voice activity detection using diffusion maps

Leveraging Reverberation and Visual Depth Cues for Sound Event Localization and Detection with Distance Estimation

Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Tragic Talkers: A Shakespearean Sound- and Light-Field Dataset for Audio-Visual Machine Learning Research

Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Target Active Speaker Detection with Audio-visual Cues

Cross modal video representations for weakly supervised active speaker localization

Bi-Channel Sound Source Localization System for Speaker Detection

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion