MARS: Paying more attention to visual attributes for text-based person search

Alex Ergasti,Tomaso Fontanini,Claudio Ferrari,Massimo Bertozzi,Andrea Prati

2024-07-05

Abstract:Text-based person search (TBPS) is a problem that gained significant interest within the research community. The task is that of retrieving one or more images of a specific individual based on a textual description. The multi-modal nature of the task requires learning representations that bridge text and image data within a shared latent space. Existing TBPS systems face two major challenges. One is defined as inter-identity noise that is due to the inherent vagueness and imprecision of text descriptions and it indicates how descriptions of visual attributes can be generally associated to different people; the other is the intra-identity variations, which are all those nuisances e.g. pose, illumination, that can alter the visual appearance of the same textual attributes for a given subject. To address these issues, this paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive), which enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss. The former employs a Masked AutoEncoder trained to reconstruct randomly masked image patches with the aid of the textual description. In doing so the model is encouraged to learn more expressive representations and textual-visual relations in the latent space. The Attribute Loss, instead, balances the contribution of different types of attributes, defined as adjective-noun chunks of text. This loss ensures that every attribute is taken into consideration in the person retrieval process. Extensive experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements, with significant gains in the mean Average Precision (mAP) metric w.r.t. the current state of the art.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address two main challenges in the task of text-based person search (TBPS): inter-identity noise and intra-identity variations. Specifically: 1. **Inter-identity noise**: Due to the ambiguity and imprecision of text descriptions, different individuals may have similar descriptions, making it difficult to distinguish them during retrieval. 2. **Intra-identity variations**: Different images of the same person may have variations in appearance due to factors such as pose and lighting conditions, and the text descriptions may vary in granularity and ambiguity, leading to inconsistencies in the descriptions of the same person. To tackle these challenges, the paper proposes a novel TBPS architecture called MARS (Mae-Attribute-Relation-Sensitive), with the following core contributions: - **Visual Reconstruction Loss**: Utilizing a Masked AutoEncoder (MAE) to reconstruct randomly masked image patches, aided by text descriptions, to enhance the model's understanding of text-image relationships. - **Attribute Loss**: Balancing the importance of different types of attributes (adjective-noun combinations) to ensure that each attribute is fully considered during retrieval, thereby improving the model's discriminative ability. With these two new components, MARS is able to significantly improve the mean Average Precision (mAP) metric on multiple commonly used datasets.

MARS: Paying more attention to visual attributes for text-based person search

Text-Based Person Search with Limited Data

Text-based Person Search without Parallel Image-Text Data

Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Semi-supervised Text-based Person Search

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search

Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

Asymmetric Cross-Scale Alignment for Text-Based Person Search

Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search

An Overview of Text-based Person Search: Recent Advances and Future Directions

An Empirical Study of CLIP for Text-based Person Search

Adversarial Attribute-Text Embedding for Person Search with Natural Language Query

Multi-granularity Matching Transformer for Text-based Person Search

Improving Inconspicuous Attributes Modeling for Person Search by Language

MSTN: A Multi-granular Spatial–Temporal Network for video-based person re-identification

Hierarchical Gumbel Attention Network for Text-based Person Search

DualFocus: Integrating Plausible Descriptions in Text-based Person Re-identification

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval

Pedestrian Attribute Recognition Via Spatio-temporal Relationship Learning for Visual Surveillance