3D Visual Grounding-Audio: 3D scene object detection based on audio

Can Zhang,Zeyu Cai,Xunhao Chen,Feipeng Da,Shaoyan Gai
DOI: https://doi.org/10.1016/j.neucom.2024.128637
IF: 6
2024-09-26
Neurocomputing
Abstract:3D Visual Grounding (3DVG) is a prevalent multi-modal information fusion task capable of accurately localizing target objects referenced in natural language descriptions within a point cloud scene. Nevertheless, the stringent demands for input and output devices present substantial hurdles for the application and integration of 3DVG in fields like remote robotic control and telemedicine. To address this challenge, we introduce several innovative approaches. Firstly, we have initiated a novel multi-modal task, termed 3D Visual Grounding-Audio (3DVG-Audio), which is based on the fusion of audio and point cloud. To the best of our knowledge, this represents the first instance of an Audio-Point Cloud multi-modal task. 3DVG-Audio achieves precise localization of audio-mentioned objects within the point cloud by utilizing the point cloud in conjunction with the corresponding audio input. Secondly, building upon the ScanRefer, we have developed a novel dataset named 3DVG-AudioSet, specifically designed for the training and evaluation of the 3DVG-Audio method. Finally, we have crafted a tailored loss function to further enhance the performance of 3DVG-Audio and introduced a method named AP-Refer, which serves as a benchmark for the task. Extensive experimental results demonstrate the potential for deep integration of audio and point cloud to tackle complex real-world challenges. AP-Refer has successfully addressed the 3DVG-Audio, circumventing the limitations of conventional 3DVG methods, and exhibits significant application potential.
computer science, artificial intelligence
What problem does this paper attempt to address?