Learning shared features from specific and ambiguous descriptions for text-based person search

Ke Cheng,Qikai Geng,Shucheng Huang,Juanjuan Tu,Hu Lu,Tu, Juanjuan
DOI: https://doi.org/10.1007/s00530-024-01286-z
IF: 3.9
2024-03-28
Multimedia Systems
Abstract:Text-based person search endeavors to utilize natural language descriptions for retrieving pedestrian images. Previous studies have primarily focused on leveraging information among pedestrians with distinct identities, overlooking the exploration of data variations within the same identity. Although some have attempted to extract multiple samples for each identity, an appropriate loss function was not employed. In response to this research gap, we present LFSA, a concise cross-model framework that Learns shared Features from Specific and Ambiguous descriptions. Firstly, building upon a distinctive sampling strategy, we formulate the Boundary Constraints Loss (BCL) and the Hard Sample Mining Loss (HSML) with the aim of extracting unique features from specific descriptions while simultaneously capturing shared features from ambiguous descriptions. Then, we introduce a textual augmentation module denoted as Mask-Delete-Replace (MDR). This module employs three operations to direct the model's attention toward more comprehensive details within the textual descriptors. LFSA utilizes CLIP as the backbone of the network, only leveraging its global features from the [CLS] token. Extensive experiments on two benchmark datasets, CUHK-PEDES and ICFG-PEDES, demonstrate the effectiveness of our approach. Codes are available at https://github.com/CottonCandyZ/LFSA.
computer science, information systems, theory & methods
What problem does this paper attempt to address?