Wnet: Audio-Guided Video Object Segmentation Via Wavelet-Based Cross- Modal Denoising Networks

Wenwen Pan,Haonan Shi,Zhou Zhao,Jieming Zhu,Xiuqiang He,Zhigeng Pan,Lianli Gao,Jun Yu,Fei Wu,Qi Tian
DOI: https://doi.org/10.1109/cvpr52688.2022.00138
2022-01-01
Abstract:Audio-Guided video object segmentation is a challenging problem in visual analysis and editing, which automatically separates foreground objects from the background in a video sequence according to the referring audio expressions. However, existing referring video object segmentation works mainly focus on the guidance of text-based referring expressions, due to the lack of modeling the semantic representations of audio-video interaction contents. In this paper, we consider the problem of audio-guided video semantic segmentation from the viewpoint of end-to-end denoising encoder-decoder network learning. We propose the wavelet-based encoder network to learn the cross-modal representations of the video contents with audio-form queries. Specifically, we adopt the multi-head cross-modal attention layers to explore the potential relations of video and query contents. A 2-dimension discrete wavelet trans-form is merged into the transformer encoder to decompose the audio-video features. Next, we maximize mutual information between the encoded features and multi-modal features after cross-modal attention layers to enhance the au-dio guidance. Then, a self attention-free decoder network is developed to generate the target masks with frequency-domain transforms. In addition, we construct the first large-scale audio-guided video semantic segmentation dataset. The extensive experiments show the effectiveness of our method 1 1 Code is available at: https://github.com/asudahkzj/Wnet.git.
What problem does this paper attempt to address?