Look, Listen and Infer.

Ruijian Jia,Xinsheng Wang,Shanmin Pang,Jihua Zhu,Jianru Xue
DOI: https://doi.org/10.1145/3394171.3414023
2020-01-01
Abstract:Inspired by the ability of human beings on recognizing the relations between visual scenes and sounds, many cross-modal learning methods have been developed for modeling images or videos and associated sounds. In this work, for the first time, a Look, Listen and Infer Network (LLINet) is proposed to learn a zero-shot model that can infer the relations of visual scenes and sounds from novel categories never appeared before. LLINet is mainly desired to qualify for two tasks, i.e., image-audio cross-modal retrieval and sound localization in each image. Towards this end, it is designed as a two-branch encoding network that builds a common space for images and audios. Besides, a cross-modal attention mechanism is proposed in LLINet to localize sound objects. To evaluate LLINet, a new data set, named INSTRUMENT-32CLASS, is collected in this work. Besides zero-shot cross-modal retrieval and sound localization, a zero-shot image recognition task based on sounds is also conducted on this database. All experimental results on these tasks demonstrate the effectiveness of LLINet, indicating that zero-shot learning for visual scenes and sounds is feasible. The project page for LLINet is available at https://llinet.github.io/.
What problem does this paper attempt to address?