Exploiting Visual Context Semantics for Sound Source Localization.

Xinchi Zhou,Dongzhan Zhou,Di Hu,Hang Zhou,Wanli Ouyang
DOI: https://doi.org/10.1109/wacv56688.2023.00517
2023-01-01
Abstract:Self-supervised sound source localization in unconstrained visual scenes is an important task of audio-visual learning. In this paper, we propose a visual reasoning module to explicitly exploit the rich visual context semantics, which alleviates the issue of insufficient utilization of visual information in previous works. The learning objectives are carefully designed to provide stronger supervision signals for the extracted visual semantics while enhancing the audio-visual interactions, which lead to more robust feature representations. Extensive experimental results demonstrate that our approach significantly boosts the localization performances on various datasets, even without initializations pretrained on ImageNet. Moreover, with the visual context exploitation, our framework can accomplish both the audio-visual and purely visual inference, which expands the application scope of the sound source localization task and further raises the competitiveness of our approach.
What problem does this paper attempt to address?