Abstract:Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit{e.g.}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments between them much easier. Besides, to leverage the rich semantics, we devise a hierarchical alignment scheme to explore multi-level correspondences of different layers between images and texts. To evaluate the effectiveness of the proposed HAT, we conduct extensive experiments on two benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that HAT outperforms SOTA baselines by a large margin. Specifically, on two key tasks, \textit{i.e.}, image-to-text and text-to-image retrieval, HAT achieves 7.6\% and 16.7\% relative score improvement of Recall@1 on MSCOCO, and 4.4\% and 11.6\% on Flickr30k respectively. The code is available at \url{<a class="link-external link-https" href="https://github.com/LuminosityX/HAT" rel="external noopener nofollow">this https URL</a>}.

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval

Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Multi-grained Representation Learning for Cross-modal Retrieval

GPA: Global and Prototype Alignment for Audio-Text Retrieval

Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning

Transavs: End-To-End Audio-Visual Segmentation With Transformer

Audio-Text Retrieval in Context

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Mutual Alignment between Audiovisual Features for End-to-End Audiovisual Speech Recognition

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Almost Unsupervised Text to Speech and Automatic Speech Recognition

EDTC: enhance depth of text comprehension in automated audio captioning

CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL

Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition