Abstract:Text-based person search aims to retrieve the pedestrian images that best match a given textual description from gallery images. Previous methods utilize the soft-attention mechanism to infer the semantic alignments between the regions of image and the corresponding words in sentence. However, these methods may fuse the irrelevant multi-modality features together which cause matching redundancy problem. In this work, we propose a novel hierarchical Gumbel attention network for text-based person search via Gumbel top-k re-parameterization algorithm. Specifically, it adaptively selects the strong semantically relevant image regions and words/phrases from images and texts for precise alignment and similarity calculation. This hard selection strategy is able to fuse the strong-relevant multi-modality features for alleviating the problem of matching redundancy. Meanwhile, a Gumbel top-k reparameterization algorithm is designed as a low-variance, unbiased gradient estimator to handle the discreteness problem of hard attention mechanism by an end-to-end manner. Moreover, a hierarchical adaptive matching strategy is employed by the model from three different granularities, i.e., word-level, phrase-level, and sentence-level, towards fine-grained matching. Extensive experimental results demonstrate the state-of-the-art performance. Compared the existed best method, we achieve the 8.24% Rank-1 and 7.6% mAP relative improvements in the text-to-image retrieval task, and 5.58% Rank-1 and 6.3% mAP relative improvements in the image-to-text retrieval task on CUHK-PEDES dataset, respectively.

Gumbel-Attention for Multi-modal Machine Translation

Multimodal Transformer For Multimodal Machine Translation

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

Multimodal Image-to-Image Translation via Mutual Information Estimation and Maximization

Increasing Visual Awareness in Multimodal Neural Machine Translation from an Information Theoretic Perspective

Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

Contrastive Learning Based Visual Representation Enhancement for Multimodal Machine Translation

Supervised Visual Attention for Simultaneous Multimodal Machine Translation

Latent Variable Model for Multi-modal Translation

Enhancing Neural Machine Translation with Dual-Side Multimodal Awareness

Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

A Visual Attention Grounding Neural Model for Multimodal Machine Translation

Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation

Hierarchical Gumbel Attention Network for Text-based Person Search

M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention

Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling

Attention-Based Modality-Gated Networks for Image-Text Sentiment Analysis

Multimodal Neural Machine Translation with Search Engine Based Image Retrieval

Visual Agreement Regularized Training for Multi-Modal Machine Translation

3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset

Adding Multimodal Capabilities to a Text-only Translation Model