Towards Language-guided Visual Recognition Via Dynamic Convolutions

Gen Luo,Yiyi Zhou,Xiaoshuai Sun,Yongjian Wu,Yue Gao,Rongrong Ji
DOI: https://doi.org/10.1007/s11263-023-01871-1
IF: 13.369
2024-01-01
International Journal of Computer Vision
Abstract:In this paper, we are committed to establishing a unified and end-to-end multi-modal network via exploring language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-guided Dynamic Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build a fully language-driven convolution network, termed as LaConvNet , which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on seven benchmark datasets of three vision-and-language tasks, i.e., visual question answering, referring expression comprehension and segmentation. The experimental results not only show the competitive or better performance of LaConvNet against existing multi-modal networks, but also witness the merits of LaConvNet as an unified structure, including compact network, low computational cost and high generalization ability. Our source code is released in SimREC project: https://github.com/luogen1996/LaConvNet .
What problem does this paper attempt to address?