Region-based Quality Estimation Network for Large-scale Person Re-identification

Guanglu Song,Biao Leng,Yu Liu,Congrui Hetang,Shaofan Cai
DOI: https://doi.org/10.48550/arXiv.1711.08766
2017-12-21
Abstract:One of the major restrictions on the performance of video-based person re-id is partial noise caused by occlusion, blur and illumination. Since different spatial regions of a single frame have various quality, and the quality of the same region also varies across frames in a tracklet, a good way to address the problem is to effectively aggregate complementary information from all frames in a sequence, using better regions from other frames to compensate the influence of an image region with poor quality. To achieve this, we propose a novel Region-based Quality Estimation Network (RQEN), in which an ingenious training mechanism enables the effective learning to extract the complementary region-based information between different frames. Compared with other feature extraction methods, we achieved comparable results of 92.4%, 76.1% and 77.83% on the PRID 2011, iLIDS-VID and MARS, respectively. In addition, to alleviate the lack of clean large-scale person re-id datasets for the community, this paper also contributes a new high-quality dataset, named "Labeled Pedestrian in the Wild (LPW)" which contains 7,694 tracklets with over 590,000 images. Despite its relatively large scale, the annotations also possess high cleanliness. Moreover, it's more challenging in the following aspects: the age of characters varies from childhood to elderhood; the postures of people are diverse, including running and cycling in addition to the normal walking state.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve some noise problems in the task of person re - identification (re - id) in videos. These noises are mainly caused by occlusion, blurring and illumination changes. Specifically, since different spatial regions in the same frame have different qualities, and the quality of the same region also varies in different frames, how to effectively aggregate the complementary information of all frames in the sequence and use the high - quality regions in other frames to compensate for the influence of low - quality regions in a certain frame has become the focus of this research. To meet this challenge, the author proposes a Region - based Quality Estimation Network (RQEN). Through an ingenious training mechanism, RQEN can effectively learn to extract complementary region information between different frames. In addition, RQEN also introduces a new large - scale, high - quality dataset "Labeled Pedestrian in the Wild (LPW)" to alleviate the problems of small scale or unclean annotation in current person re - identification datasets. ### Main contributions 1. **Propose RQEN**: For the first time, consider the quality of different regions of an image, better aggregate the complementary region information in the sequence, and use the high - quality information of an image region to make up for the low - quality influence of the same region in other frames. 2. **Jointly train multi - level features**: Propose a joint training pipeline, enabling the region quality predictor to generate appropriate region quality evaluations and achieving state - of - the - art performance on video - level person re - identification tasks. 3. **Construct the LPW dataset**: Construct a large - scale and high - quality person re - identification dataset, which contains 7,694 trajectories and more than 590,000 images, providing a benchmark test closer to the actual scene. ### Method overview The main architecture of RQEN includes: - **Fully convolutional network**: Generate an intermediate representation of the input image. - **Region feature generation unit**: Mark human key points through a key point detector, divide the intermediate representation into different regions, and generate a feature vector for each region. - **Region quality predictor**: Generate a quality score for each region, with the score ranging from 0 to 1. - **Set aggregation unit**: According to the region quality scores, weight and aggregate the features of all frames to generate the final video - level feature representation. ### Experimental results - **Performance improvement**: On the PRID 2011 dataset, the top - 1 accuracy of RQEN is 1.5% higher than that of existing methods; on the iLIDS - VID dataset, it is 9.1% higher. - **Large - scale dataset verification**: On the MARS and LPW datasets, RQEN also shows excellent performance. Especially on the LPW dataset, the top - 1 accuracy is 15.6% higher than that of the baseline model. ### Conclusion By considering the quality of different regions of an image, RQEN effectively aggregates the complementary information in the sequence and significantly improves the performance of person re - identification in videos. At the same time, the construction of the LPW dataset also provides valuable resources for further research in this field.