Abstract:Video-based pedestrian re-identification (Re-ID) is used to re-identify the same person across different camera views. One of the key problems is to learn an effective representation for the pedestrian from video. However, it is difficult to learn an effective representation from one single modality of a feature due to complicated issues with video, such as background, occlusion, and blurred scenes. Therefore, there are some studies on fusing multimodal features for video-based pedestrian Re-ID. However, most of these works fuse features at the global level, which is not effective in reflecting fine-grained and complementary information. Therefore, the improvement in performance is limited. To obtain a more effective representation, we propose to learn fine-grained features from different modalities of the video, and then they are aligned and fused at the fine-grained level to capture rich semantic information. As a result, a multimodal token-learning and alignment model (MTLA) is proposed to re-identify pedestrians across camera videos. An MTLA consists of three modules, i.e., a multimodal feature encoder, token-based cross-modal alignment, and correlation-aware fusion. Firstly, the multimodal feature encoder is used to extract the multimodal features from the visual appearance and gait information views, and then fine-grained tokens are learned and denoised from these features. Then, the token-based cross-modal alignment module is used to align the multimodal features at the token level to capture fine-grained semantic information. Finally, the correlation-aware fusion module is used to fuse the multimodal token features by learning the inter- and intra-modal correlation, in which the features refine each other and a unified representation is obtained for pedestrian Re-ID. To evaluate the performance of fine-grained features alignment and fusion, we conduct extensive experiments on three benchmark datasets. Compared with the state-of-art approaches, all the evaluation metrices of mAP and Rank-K are improved by more than 0.4 percentage points.

GLSFF: Global–local Specific Feature Fusion for Cross-Modality Pedestrian Re-Identification

A Novel Two-Stream Saliency Image Fusion CNN Architecture for Person Re-Identification

Person Re-identification Network Based on Multi-Level Feature Fusion

Gaussian-based Probability Fusion for Person Re-Identification with Taylor Angular Margin Loss

Joining Features by Global Guidance with Bi-Relevance Trihard Loss for Person Re-Identification

FCL: Pedestrian Re-Identification Algorithm Based on Feature Fusion Contrastive Learning

Dual-stream feature fusion network for person re-identification

Pedestrian Re-Identification Based on Gait Analysis

Pedestrian Re-Identification Based on Fine-Grained Feature Learning and Fusion

Weight Determination In Multi-Feature Fusion For Pedestrian Re-Identification

Multiple-local feature and attention fused person re-identification method

Glad: Global-Local-Alignment Descriptor For Pedestrian Retrieval

Person Re-Identification Based on Spatial Feature Learning and Multi-Granularity Feature Fusion

Cross modality person re-identification via mask-guided dynamic dual-task collaborative learning

Dual adaptive alignment and partitioning network for visible and infrared cross-modality person re-identification

Cross-Modality Person Re-Identification Based on Heterogeneous Center Loss and Non-Local Features

Cross-modal Local Shortest Path and Global Enhancement for Visible-Thermal Person Re-Identification

Salience-Guided Cascaded Suppression Network for Person Re-identification

DIMGNet: A Transformer-based Network for Pedestrian Reidentification with Multi-granularity Information Mutual Gain

Pedestrian Re-ID based on feature consistency and contrast enhancement

Person Reidentification via Multi-Feature Fusion With Adaptive Graph Learning