Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

Yi Bin,Haoxuan Li,Yahui Xu,Xing Xu,Yang Yang,Heng Tao Shen

DOI: https://doi.org/10.1145/3581783.3612427

2023-08-08

Abstract:Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit{e.g.}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments between them much easier. Besides, to leverage the rich semantics, we devise a hierarchical alignment scheme to explore multi-level correspondences of different layers between images and texts. To evaluate the effectiveness of the proposed HAT, we conduct extensive experiments on two benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that HAT outperforms SOTA baselines by a large margin. Specifically, on two key tasks, \textit{i.e.}, image-to-text and text-to-image retrieval, HAT achieves 7.6\% and 16.7\% relative score improvement of Recall@1 on MSCOCO, and 4.4\% and 11.6\% on Flickr30k respectively. The code is available at \url{<a class="link-external link-https" href="https://github.com/LuminosityX/HAT" rel="external noopener nofollow">this https URL</a>}.

Computer Vision and Pattern Recognition,Information Retrieval,Multimedia

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper primarily addresses the issue of inconsistent encoder architectures in cross-modal retrieval. Specifically: 1. **Inconsistent Encoder Architectures**: Existing cross-modal retrieval methods typically use dual-stream encoders with different architectures to process images and texts (e.g., using CNNs for images and RNNs or Transformers for texts). This architectural discrepancy can lead to different semantic distribution spaces, limiting the interaction between images and texts, and further resulting in poor alignment between them. 2. **Proposing a Unified Architecture**: To fill this research gap, the authors, inspired by the progress of Transformers in visual tasks, propose a unified encoder architecture based on Transformers to handle both image and text data. Specifically, they design a fully dual-stream Transformer-based cross-modal retrieval framework called Hierarchical Alignment Transformer (HAT), which includes an image Transformer, a text Transformer, and a hierarchical alignment module. 3. **Improving Alignment Performance**: By using the same architecture, the encoders can generate image and text representations with more similar characteristics, making interaction and alignment easier. Additionally, to leverage rich semantic information, they design a hierarchical alignment scheme to explore the correspondences between different levels of images and texts. 4. **Experimental Validation**: Extensive experiments on two benchmark datasets, MSCOCO and Flickr30K, demonstrate the significant advantages of HAT in image-to-text and text-to-image retrieval tasks. Specifically, on the MSCOCO dataset, HAT improves the Recall@1 metric by 7.6% and 16.7%, respectively, while on the Flickr30K dataset, it improves by 4.4% and 11.6%, respectively. In summary, this paper aims to enhance cross-modal retrieval effectiveness through a unified encoder architecture.

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

Fine-Grained Cross-Modal Retrieval with Triple-Streamed Memory Fusion Transformer Encoder

Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Feature Fusion Based on Transformer for Cross-modal Retrieval

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Vision Transformers with Hierarchical Attention

Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval

HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval

HAT: Hierarchical Aggregation Transformers for Person Re-identification

Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers

Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Semantic-alignment transformer and adversary hashing for cross-modal retrieval

Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning

Meta-Transformer: A Unified Framework for Multimodal Learning

CAT: Cross Attention in Vision Transformer

Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention

Hierarchical Feature Aggregation based on Transformer for Image-text Matching