A Video Is Worth Three Views: Trigeminal Transformers for Video-Based Person Re-Identification
Xuehu Liu,Pingping Zhang,Chenyang Yu,Xuesheng Qian,Xiaoyun Yang,Huchuan Lu
DOI: https://doi.org/10.1109/tits.2024.3386914
IF: 8.5
2024-09-02
IEEE Transactions on Intelligent Transportation Systems
Abstract:Video-based person Re-Identification (Re-ID) is a hot research topic in intelligent transportation systems, which aims to retrieve video sequences of the same person under non-overlapping surveillance cameras. Compared with static images, video sequences contain more visual information from multiple views, such as spatial and temporal views. However, previous Re-ID methods usually focus on single limited views, lacking diverse observations from different views. To capture richer perceptions and extract more comprehensive representations, we propose a novel learning framework named Trigeminal Transformers (TMT) to tackle video-based person Re-ID. More specifically, we first design a View-wise Projector (VP) to jointly transform raw videos from spatial, temporal and spatial-temporal views. In addition, inspired by the great success of Vision Transformers (ViT), we introduce the Transformer structure for information enhancement and aggregation. In our work, three Self-view Transformers (ST) are proposed to exploit the relationships of local features for information enhancement in spatial, temporal and spatial-temporal. Moreover, a Cross-view Transformer (CT) is proposed to aggregate the multi-view features for comprehensive representations. Experimental results indicate that our approach can obtain better performance than some other state-of-the-art approaches on four public Re-ID benchmarks.
engineering, electrical & electronic,transportation science & technology, civil