Abstract:Text-based person retrieval aims at retrieving a specific pedestrian image from a gallery based on textual descriptions. The primary challenge is how to overcome the inherent heterogeneous modality gap in the situation of significant intra-class variation and minimal inter-class variation. Existing approaches commonly employ vision-language pre-training or attention mechanisms to learn appropriate cross-modal alignments from noise inputs. Despite commendable progress, current methods inevitably suffer from two defects: 1) Matching ambiguity, which mainly derives from unreliable matching pairs; 2) One-sided cross-modal alignments, stemming from the absence of exploring one-to-many correspondence, i.e., coarse-grained semantic alignment. These critical issues significantly deteriorate retrieval performance. To this end, we propose a novel framework termed Adaptive Uncertainty-based Learning (AUL) for text-based person retrieval from the uncertainty perspective. Specifically, our AUL framework consists of three key components: 1) Uncertainty-aware Matching Filtration that leverages Subjective Logic to effectively mitigate the disturbance of unreliable matching pairs and select high-confidence cross-modal matches for training; 2) Uncertainty-based Alignment Refinement, which not only simulates coarse-grained alignments by constructing uncertainty representations but also performs progressive learning to incorporate coarse- and fine-grained alignments properly; 3) Cross-modal Masked Modeling that aims at exploring more comprehensive relations between vision and language. Extensive experiments demonstrate that our AUL method consistently achieves state-of-the-art performance on three benchmark datasets in supervised, weakly supervised, and domain generalization settings. Our code is available at https://github.com/CFM-MSG/Code-AUL.

Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search.

Adversarial Attribute-Text Embedding for Person Search with Natural Language Query

Hierarchical Gumbel Attention Network for Text-based Person Search

Text-based Person Search in Full Images via Semantic-Driven Proposal Generation

MGD-GAN: Text-to-Pedestrian Generation Through Multi-grained Discrimination

Domain Adaptive Person Search via GAN-based Scene Synthesis for Cross-scene Videos

Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search

GPAN-PS: Global-Response Pedestrian Attention Network for End-to-End Person Search

Address the Unseen Relationships: Attribute Correlations in Text Attribute Person Search

Learning Semantic-Aligned Feature Representation for Text-based Person Search

VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search

Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval

Hybrid Attention Network for Language-Based Person Search

An Overview of Text-based Person Search: Recent Advances and Future Directions

Hardest and semi-hard negative pairs mining for text-based person search with visual–textual attention

Addressing Information Inequality for Text-Based Person Search via Pedestrian-Centric Visual Denoising and Bias-Aware Alignments

Attentive Multi-Granularity Perception Network for Person Search

Multi-granularity Matching Transformer for Text-based Person Search

Multilevel Collaborative Attention Network for Person Search

PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery