Abstract:Video-based pedestrian re-identification (Re-ID) is used to re-identify the same person across different camera views. One of the key problems is to learn an effective representation for the pedestrian from video. However, it is difficult to learn an effective representation from one single modality of a feature due to complicated issues with video, such as background, occlusion, and blurred scenes. Therefore, there are some studies on fusing multimodal features for video-based pedestrian Re-ID. However, most of these works fuse features at the global level, which is not effective in reflecting fine-grained and complementary information. Therefore, the improvement in performance is limited. To obtain a more effective representation, we propose to learn fine-grained features from different modalities of the video, and then they are aligned and fused at the fine-grained level to capture rich semantic information. As a result, a multimodal token-learning and alignment model (MTLA) is proposed to re-identify pedestrians across camera videos. An MTLA consists of three modules, i.e., a multimodal feature encoder, token-based cross-modal alignment, and correlation-aware fusion. Firstly, the multimodal feature encoder is used to extract the multimodal features from the visual appearance and gait information views, and then fine-grained tokens are learned and denoised from these features. Then, the token-based cross-modal alignment module is used to align the multimodal features at the token level to capture fine-grained semantic information. Finally, the correlation-aware fusion module is used to fuse the multimodal token features by learning the inter- and intra-modal correlation, in which the features refine each other and a unified representation is obtained for pedestrian Re-ID. To evaluate the performance of fine-grained features alignment and fusion, we conduct extensive experiments on three benchmark datasets. Compared with the state-of-art approaches, all the evaluation metrices of mAP and Rank-K are improved by more than 0.4 percentage points.

Addressing Information Inequality for Text-Based Person Search via Pedestrian-Centric Visual Denoising and Bias-Aware Alignments

Towards Accurate Dense Pedestrian Detection Via Occlusion-Prediction Aware Label Assignment and Hierarchical-Nms.

Adversarial Attribute-Text Embedding for Person Search with Natural Language Query

Hierarchical Gumbel Attention Network for Text-based Person Search

Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Text-based Person Search in Full Images via Semantic-Driven Proposal Generation

Locality guided cross-modal feature aggregation and pixel-level fusion for multispectral pedestrian detection

Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search.

An Overview of Text-based Person Search: Recent Advances and Future Directions

Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval

Learning Semantic-Aligned Feature Representation for Text-based Person Search

Beyond the Parts: Learning Coarse-to-Fine Adaptive Alignment Representation for Person Search

Text-based person search via cross-modal alignment learning

Pedestrian Re-Identification Based on Fine-Grained Feature Learning and Fusion

Cross-Modality Proposal-guided Feature Mining for Unregistered RGB-Thermal Pedestrian Detection

Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search

The Cross-Modality Disparity Problem in Multispectral Pedestrian Detection.

Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems

VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search

Domain Adaptive Person Search via GAN-based Scene Synthesis for Cross-scene Videos