Image-Text Matching with Multi-View Attention

Rui Cheng,Wanqing Cui
2024-02-27
Abstract:Existing two-stream models for image-text matching show good performance while ensuring retrieval speed and have received extensive attention from industry and academia. These methods use a single representation to encode image and text separately and get a matching score with cosine similarity or the inner product of vectors. However, the performance of the two-stream model is often sub-optimal. On the one hand, a single representation is challenging to cover complex content comprehensively. On the other hand, in this framework of lack of interaction, it is challenging to match multiple meanings which leads to information being ignored. To address the problems mentioned above and facilitate the performance of the two-stream model, we propose a multi-view attention approach for two-stream image-text matching MVAM (\textbf{M}ulti-\textbf{V}iew \textbf{A}ttention \textbf{M}odel). It first learns multiple image and text representations by diverse attention heads with different view codes. And then concatenate these representations into one for matching. A diversity objective is also used to promote diversity between attention heads. With this method, models are able to encode images and text from different views and attend to more key points. So we can get representations that contain more information. When doing retrieval tasks, the matching scores between images and texts can be calculated from different aspects, leading to better matching performance. Experiment results on MSCOCO and Flickr30K show that our proposed model brings improvements over existing models. Further case studies show that different attention heads can focus on different contents and finally obtain a more comprehensive representation.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the suboptimal performance of existing two-stream models in image-text matching tasks. Specifically: 1. **Limitations of Single Representation**: Existing two-stream models typically use a single representation to encode images and texts, which makes it difficult to comprehensively cover complex content, especially when the input text or image is complex. 2. **Lack of Interaction Leading to Information Overlook**: In the existing frameworks, due to the lack of interaction between images and texts, ambiguous matching becomes difficult, leading to information being overlooked. To overcome these issues, the authors propose a Multi-View Attention Model (MV AM), which learns multiple image and text representations through different attention heads and concatenates these representations into a final representation. Additionally, a diversity objective function is introduced to promote diversity among different attention heads, enabling the encoding of images and texts from different perspectives, focusing on more key points, and improving matching performance. ### Key Points of the Solution 1. **Multi-View Representation**: Learning multiple image and text representations through different attention heads, with each attention head focusing on different content. 2. **Diversity Objective Function**: Introducing diversity loss to ensure diversity among different attention heads, avoiding redundancy. 3. **Concatenated Representation**: Concatenating the representations of multiple views into a final representation for the matching task. ### Experimental Results Experimental results show that MV AM significantly improves the performance of existing models on the MSCOCO and Flickr30K datasets. Particularly in handling complex queries, MV AM can more accurately match images and texts. For example, in long and complex text queries, MV AM can retrieve images that better meet the requirements. ### Case Analysis The authors demonstrate the advantages of MV AM through specific cases. For instance, in the query "a kitchen with 2 windows and 2 metal sinks," the CLIP model overlooks the detail of "2 windows," while the MV AM-CLIP model successfully retrieves an image that fully meets the query requirements. ### Conclusion MV AM significantly improves the performance of two-stream models in image-text matching tasks through the multi-view attention mechanism and diversity loss, especially excelling in handling complex queries.