Abstract:Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still insufficient for real-world operation, as the mapping from 2D images to 3D scenes is missing. To address this fundamental limitation, we introduce a novel research problem, 3D Audio-Visual Segmentation, extending the existing AVS to the 3D output space. This problem poses more challenges due to variations in camera extrinsics, audio scattering, occlusions, and diverse acoustics across sounding object categories. To facilitate this research, we create the very first simulation based benchmark, 3DAVS-S34-O7, providing photorealistic 3D scene environments with grounded spatial audio under single-instance and multi-instance settings, across 34 scenes and 7 object categories. This is made possible by re-purposing the Habitat simulator to generate comprehensive annotations of sounding object locations and corresponding 3D masks. Subsequently, we propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models synergistically with 3D visual scene representation through spatial audio-aware mask alignment and refinement. Extensive experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI. Project page: <a class="link-external link-https" href="https://surrey-uplab.github.io/research/3d-audio-visual-segmentation/" rel="external noopener nofollow">this https URL</a>

Active Audio-Visual Separation of Dynamic Sound Sources

Sound Adversarial Audio-Visual Navigation

Few-Shot Audio-Visual Learning of Environment Acoustics

3D Audio-Visual Segmentation

ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

LAVSS: Location-Guided Audio-Visual Spatial Audio Separation

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

Learning to Separate Object Sounds by Watching Unlabeled Video

Audio-Visual Talker Localization in Video for Spatial Sound Reproduction

AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments

High-Quality Visually-Guided Sound Separation from Diverse Categories

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Continual Audio-Visual Sound Separation

Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction

Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Novel-View Acoustic Synthesis from 3D Reconstructed Rooms

CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments

Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras