Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

Yi Bin,Haoxuan Li,Yahui Xu,Xing Xu,Yang Yang,Heng Tao Shen
DOI: https://doi.org/10.1145/3581783.3612427
2023-08-08
Abstract:Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit{e.g.}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments between them much easier. Besides, to leverage the rich semantics, we devise a hierarchical alignment scheme to explore multi-level correspondences of different layers between images and texts. To evaluate the effectiveness of the proposed HAT, we conduct extensive experiments on two benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that HAT outperforms SOTA baselines by a large margin. Specifically, on two key tasks, \textit{i.e.}, image-to-text and text-to-image retrieval, HAT achieves 7.6\% and 16.7\% relative score improvement of Recall@1 on MSCOCO, and 4.4\% and 11.6\% on Flickr30k respectively. The code is available at \url{<a class="link-external link-https" href="https://github.com/LuminosityX/HAT" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Information Retrieval,Multimedia
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper primarily addresses the issue of inconsistent encoder architectures in cross-modal retrieval. Specifically: 1. **Inconsistent Encoder Architectures**: Existing cross-modal retrieval methods typically use dual-stream encoders with different architectures to process images and texts (e.g., using CNNs for images and RNNs or Transformers for texts). This architectural discrepancy can lead to different semantic distribution spaces, limiting the interaction between images and texts, and further resulting in poor alignment between them. 2. **Proposing a Unified Architecture**: To fill this research gap, the authors, inspired by the progress of Transformers in visual tasks, propose a unified encoder architecture based on Transformers to handle both image and text data. Specifically, they design a fully dual-stream Transformer-based cross-modal retrieval framework called Hierarchical Alignment Transformer (HAT), which includes an image Transformer, a text Transformer, and a hierarchical alignment module. 3. **Improving Alignment Performance**: By using the same architecture, the encoders can generate image and text representations with more similar characteristics, making interaction and alignment easier. Additionally, to leverage rich semantic information, they design a hierarchical alignment scheme to explore the correspondences between different levels of images and texts. 4. **Experimental Validation**: Extensive experiments on two benchmark datasets, MSCOCO and Flickr30K, demonstrate the significant advantages of HAT in image-to-text and text-to-image retrieval tasks. Specifically, on the MSCOCO dataset, HAT improves the Recall@1 metric by 7.6% and 16.7%, respectively, while on the Flickr30K dataset, it improves by 4.4% and 11.6%, respectively. In summary, this paper aims to enhance cross-modal retrieval effectiveness through a unified encoder architecture.