Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment

Zheren Fu,Lei Zhang,Hou Xia,Zhendong Mao
DOI: https://doi.org/10.1109/cvpr52733.2024.02485
2024-01-01
Abstract:Cross-modal alignment aims to build a bridge connecting vision and language. It is an important multi-modal task that efficiently learns the semantic similarities be-tween images and texts. Traditional finegrained alignment methods heavily rely on pre-trained object detectors to extract region features for subsequent region-word alignment, thereby incurring substantial computational costs for region detection and error propagation issues for two-stage training. In this paper, we focus on the mainstream vision transformer, incorporating patch features for patch-word alignment, while addressing the resultant issue of visual patch redundancy and patch ambiguity for semantic alignment. We propose a novel Linguistic-Aware Patch Slim-ming (LAPS) framework for fine-grained alignment, which explicitly identifies redundant visual patches with language supervision and rectifies their semantic and spatial information to facilitate more effective and consistent patchword alignment. Extensive experiments on various evaluation benchmarks and model backbones show LAPS out-performs the state-of-the-art fine-grained alignment methods by 5%-15% rSum. Our code is available at https://github.com/CrossmodalGroup/LAPS.
What problem does this paper attempt to address?