Abstract:Aiming to locate the object that emits a specified sound in complex scenes, the task of sounding object localization bridges two perception-oriented modalities of vision and acoustics, and brings enormous research value to the comprehensive perceptual understanding of machine intelligence. Although there are massive training data collected in this field, few of them contain accurate bounding box annotations, hindering the learning process and further application of proposed models. In order to address this problem, we try to explore an effective multi-modal knowledge transfer strategy to obtain precise knowledge from other similar tasks and transfer it through well-aligned multi-modal data to deal with this task in a zero-resource manner. Concretely, we design and propose a novel \textit{Two-stream Universal Referring localization Network} (TURN), which is composed of a localization stream and an alignment stream to carry out different functions. The former is utilized to extract the knowledge related to referring object localization from the image grounding task, while the latter is devised to learn a universal semantic space shared between texts and audios. Moreover, we further develop an adaptive sampling strategy to automatically identify the overlap between different data domains, thus boosting the performance and stability of our model. The extensive experiments on various publicly-available benchmarks demonstrate that TURN can achieve competitive performance compared with the state-of-the-art approaches without using any data in this field, which verifies the feasibility of our proposed mechanisms and strategies.

Look, Listen and Infer.

Structured Label Inference for Visual Understanding.

Inl: Implicit Non-Local Network

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

AND: Audio Network Dissection for Interpreting Deep Acoustic Models

ImageNetVC: Zero- and Few-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

IDLN: Iterative Distribution Learning Network for Few-Shot Remote Sensing Image Scene Classification

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization

Boosting Audio-visual Zero-shot Learning with Large Language Models

Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

SSLNet: A Network for Cross-Modal Sound Source Localization in Visual Scenes

Language-Inspired Relation Transfer for Few-Shot Class-Incremental Learning

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

Audio-Visual LLM for Video Understanding

Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models