Abstract:Existing two-stream models for image-text matching show good performance while ensuring retrieval speed and have received extensive attention from industry and academia. These methods use a single representation to encode image and text separately and get a matching score with cosine similarity or the inner product of vectors. However, the performance of the two-stream model is often sub-optimal. On the one hand, a single representation is challenging to cover complex content comprehensively. On the other hand, in this framework of lack of interaction, it is challenging to match multiple meanings which leads to information being ignored. To address the problems mentioned above and facilitate the performance of the two-stream model, we propose a multi-view attention approach for two-stream image-text matching MVAM (\textbf{M}ulti-\textbf{V}iew \textbf{A}ttention \textbf{M}odel). It first learns multiple image and text representations by diverse attention heads with different view codes. And then concatenate these representations into one for matching. A diversity objective is also used to promote diversity between attention heads. With this method, models are able to encode images and text from different views and attend to more key points. So we can get representations that contain more information. When doing retrieval tasks, the matching scores between images and texts can be calculated from different aspects, leading to better matching performance. Experiment results on MSCOCO and Flickr30K show that our proposed model brings improvements over existing models. Further case studies show that different attention heads can focus on different contents and finally obtain a more comprehensive representation.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the suboptimal performance of existing two-stream models in image-text matching tasks. Specifically: 1. **Limitations of Single Representation**: Existing two-stream models typically use a single representation to encode images and texts, which makes it difficult to comprehensively cover complex content, especially when the input text or image is complex. 2. **Lack of Interaction Leading to Information Overlook**: In the existing frameworks, due to the lack of interaction between images and texts, ambiguous matching becomes difficult, leading to information being overlooked. To overcome these issues, the authors propose a Multi-View Attention Model (MV AM), which learns multiple image and text representations through different attention heads and concatenates these representations into a final representation. Additionally, a diversity objective function is introduced to promote diversity among different attention heads, enabling the encoding of images and texts from different perspectives, focusing on more key points, and improving matching performance. ### Key Points of the Solution 1. **Multi-View Representation**: Learning multiple image and text representations through different attention heads, with each attention head focusing on different content. 2. **Diversity Objective Function**: Introducing diversity loss to ensure diversity among different attention heads, avoiding redundancy. 3. **Concatenated Representation**: Concatenating the representations of multiple views into a final representation for the matching task. ### Experimental Results Experimental results show that MV AM significantly improves the performance of existing models on the MSCOCO and Flickr30K datasets. Particularly in handling complex queries, MV AM can more accurately match images and texts. For example, in long and complex text queries, MV AM can retrieve images that better meet the requirements. ### Case Analysis The authors demonstrate the advantages of MV AM through specific cases. For instance, in the query "a kitchen with 2 windows and 2 metal sinks," the CLIP model overlooks the detail of "2 windows," while the MV AM-CLIP model successfully retrieves an image that fully meets the query requirements. ### Conclusion MV AM significantly improves the performance of two-stream models in image-text matching tasks through the multi-view attention mechanism and diversity loss, especially excelling in handling complex queries.

Image-Text Matching with Multi-View Attention

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

Multiview adaptive attention pooling for image-text retrieval

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

Dual Semantic Relationship Attention Network for Image-Text Matching

Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

Reference-Aware Adaptive Network for Image-Text Matching

Multi-Modality Cross Attention Network for Image and Sentence Matching

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Giving Text More Imagination Space for Image-text Matching

Multi-view and region reasoning semantic enhancement for image-text retrieval

Scene-text aware cross-modal retrieval based on semantic matching (ChinaMM2024)

Image–Text Matching Model Based on CLIP Bimodal Encoding

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

An End-to-End Image-Text Matching Approach Considering Semantic Uncertainty

Multi-level network based on transformer encoder for fine-grained image–text matching

Two-Stream Video Classification with Cross-Modality Attention

Short text matching model with multiway semantic interaction based on multi-granularity semantic embedding

Image-text matching using multi-subspace joint representation

Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching