Jialong Zuo,Hanyu Zhou,Ying Nie,Feng Zhang,Tianyu Guo,Nong Sang,Yunhe Wang,Changxin Gao
Abstract:Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. To address this problem, we contribute a new benchmark named \textbf{UFineBench} for text-based person retrieval with ultra-fine granularity.
Firstly, we construct a new \textbf{dataset} named UFine6926. We collect a large number of person images and manually annotate each image with two detailed textual descriptions, averaging 80.8 words each. The average word count is three to four times that of the previous datasets. In addition of standard in-domain evaluation, we also propose a special \textbf{evaluation paradigm} more representative of real scenarios. It contains a new evaluation set with cross domains, cross textual granularity and cross textual styles, named UFine3C, and a new evaluation metric for accurately measuring retrieval ability, named mean Similarity Distribution (mSD). Moreover, we propose CFAM, a more efficient \textbf{algorithm} especially designed for text-based person retrieval with ultra fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism.
With standard in-domain evaluation, CFAM establishes competitive performance across various datasets, especially on our ultra fine-grained UFine6926. Furthermore, by evaluating on UFine3C, we demonstrate that training on our UFine6926 significantly improves generalization to real scenarios compared with other coarse-grained datasets. The dataset and code will be made publicly available at \url{<a class="link-external link-https" href="https://github.com/Zplusdragon/UFineBench" rel="external noopener nofollow">this https URL</a>}.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in text - driven face retrieval tasks, the text annotation granularity of existing datasets is too coarse, which makes it difficult for the model to understand the fine - grained semantics of query texts in practical scenarios. Specifically, the paper points out the following main problems:
1. **The text annotation granularity of existing datasets is too coarse**: Existing text - driven face retrieval datasets (such as CUHK - PEDES, ICFG - PEDES, and RSTPReid) claim to be fine - grained, but in fact, the text descriptions they provide are often rough, only containing some common appearance features and lacking specific descriptions of unique appearance features. This makes the model only able to recognize typical attribute features and unable to understand the fine - grained semantics in complex query texts.
2. **The ambiguity of text - image matching**: In existing datasets, one text description may correspond to multiple different identities, which introduces significant ambiguity during the training process and hinders the model from accurately understanding the matching relationship between text and image.
3. **The limitations of the standard evaluation set**: Existing standard evaluation sets (such as CUHK - PEDES, ICFG - PEDES, and RSTPReid) usually have a fixed domain, a fixed text granularity, and a fixed text style, and cannot effectively evaluate the performance of the model in practical scenarios because there is usually a wide range of spatio - temporal coverage, inconsistent query text granularity, and unique styles of language expression by describers in practical scenarios.
4. **The inaccuracy of existing evaluation metrics**: Existing evaluation metrics (such as rank - k and mAP) are based on discrete ranking conditions and cannot sensitively measure the differences between continuous similarity values, resulting in inaccurate evaluations of the model's retrieval ability.
To solve these problems, the paper makes the following contributions:
1. **Constructed a high - quality fine - grained dataset**: Named UFine6926, which contains 6,926 identities, 26,206 images, and 52,412 text descriptions. Each image is manually annotated with two detailed text descriptions, with an average of 80.8 words per description, which is three to four times longer than the text descriptions in existing datasets.
2. **Constructed a special evaluation set across domains, text granularities, and text styles**: Named UFine3C, which contains 7,446 images and 37,939 text queries, involving 2,250 people. This evaluation set is more in line with practical scenarios and can better evaluate the generalization ability of the model.
3. **Proposed a new evaluation metric**: Named mean Similarity Distribution (mSD), which is based on continuous similarity values rather than discrete ranking conditions and can more sensitively measure the performance differences of the model under different similarity conditions.
4. **Designed a new cross - modal fine - grained alignment and matching framework**: Named CFAM, which realizes fine - grained mining through a shared cross - modal granularity decoder and a hard negative sample matching mechanism, and improves the performance of the model on various datasets, especially on the fine - grained UFine6926 dataset.
Through these contributions, the paper aims to promote the development of text - driven face retrieval tasks to make them more in line with the needs of practical applications.