Improving visual grounding with multi-scale discrepancy information and centralized-transformer

Jie Wu,Chunlei Wu,Fuyan Wang,Leiquan Wang,Yiwei Wei
DOI: https://doi.org/10.1016/j.eswa.2024.123223
IF: 8.5
2024-01-31
Expert Systems with Applications
Abstract:Visual grounding associates linguistic expressions with the corresponding objects or regions in an image. Current methods extract multi-scale features from the image and establish cross-modal relationships through transformers. However, the direct combination of multi-scale features often results in an excess of redundant information, which diminishes the synergistic complementarity across different scales. Furthermore, utilizing transformers to acquire compact multi-modal representations may potentially overlook essential corner regions. In this paper, we propose a unique centralized-transformer network with multi-scale discrepancy information (CTMDI) by exploring multi-scale difference features and performing centralized cross-modal reasoning for precise visual grounding. The multi-scale discrepancy information module calculates the variations of features at different scales to capture fine-grained details and maintain the overall understanding of the visual content. To enhance cross-modal interactions, a centralized transformer is proposed to simultaneously aggregate the local essential information and global distance correlations of multi-modal fusion features. Comprehensive experiments on three typical datasets demonstrate the superiority of CTMDI over existing approaches.
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science
What problem does this paper attempt to address?