You Only Speak Once to See

Wenhao Yang,Jianguo Wei,Wenhuan Lu,Lei Li
2024-09-30
Abstract:Grounding objects in images using visual cues is a well-established approach in computer vision, yet the potential of audio as a modality for object recognition and grounding remains underexplored. We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes, termed Audio Grounding. By integrating pre-trained audio models with visual models using contrastive learning and multi-modal alignment, our approach captures speech commands or descriptions and maps them directly to corresponding objects within images. Experimental results indicate that audio guidance can be effectively applied to object grounding, suggesting that incorporating audio guidance may enhance the precision and robustness of current object grounding methods and improve the performance of robotic systems and computer vision applications. This finding opens new possibilities for advanced object recognition, scene understanding, and the development of more intuitive and capable robotic systems.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is Audio Grounding, which is to locate objects in visual scenes using audio instructions. Traditionally, object localization tasks mainly rely on text and image modalities, and relatively few studies use speech as input to guide object localization. This paper attempts to fill this gap by introducing a new method - YOSS (You Only Speak Once to See). YOSS combines pre - trained audio models and visual models, and directly maps voice commands or descriptions to corresponding objects in the image through contrastive learning and multi - modal alignment techniques. This method can not only be effectively applied to object localization, but may also enhance the accuracy and robustness of current object localization methods, thereby improving the performance of robotic systems and computer vision applications. Specifically, the main contributions of the paper include: 1. Proposing an Audio - Image Grounding task for open - vocabulary object detection using audio cues. 2. Developing an Audio - Image Grounding framework that integrates multi - modal information for the alignment of images and audio. 3. Demonstrating the effectiveness of this framework through experiments on the COCO, Flickr, and GQA datasets and conducting further evaluations. These contributions indicate that integrating the speech modality into visual tasks has great potential and can promote more natural and intuitive human - machine interaction methods, especially in fields such as assistive robots, autonomous systems, and interactive AI agents.