Visible-Thermal Multiple Object Tracking: Large-scale Video Dataset and Progressive Fusion Approach

Yabin Zhu,Qianwu Wang,Chenglong Li,Jin Tang,Zhixiang Huang
2024-08-02
Abstract:The complementary benefits from visible and thermal infrared data are widely utilized in various computer vision task, such as visual tracking, semantic segmentation and object detection, but rarely explored in Multiple Object Tracking (MOT). In this work, we contribute a large-scale Visible-Thermal video benchmark for MOT, called VT-MOT. VT-MOT has the following main advantages. 1) The data is large scale and high diversity. VT-MOT includes 582 video sequence pairs, 401k frame pairs from surveillance, drone, and handheld platforms. 2) The cross-modal alignment is highly accurate. We invite several professionals to perform both spatial and temporal alignment frame by frame. 3) The annotation is dense and high-quality. VT-MOT has 3.99 million annotation boxes annotated and double-checked by professionals, including heavy occlusion and object re-acquisition (object disappear and reappear) challenges. To provide a strong baseline, we design a simple yet effective tracking framework, which effectively fuses temporal information and complementary information of two modalities in a progressive manner, for robust visible-thermal MOT. A comprehensive experiment are conducted on VT-MOT and the results prove the superiority and effectiveness of the proposed method compared with state-of-the-art methods. From the evaluation results and analysis, we specify several potential future directions for visible-thermal MOT. The project is released in <a class="link-external link-https" href="https://github.com/wqw123wqw/PFTrack" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper primarily aims to address the challenges of multi-object tracking (MOT) in complex environments, particularly under conditions such as low light and haze. Specifically: 1. **Dataset Construction**: - Constructed a large-scale visible-thermal infrared video benchmark dataset (VT-MOT) for multi-object tracking tasks. - The dataset includes 582 video sequence pairs, totaling 401k frame pairs, sourced from three platforms: drones, surveillance cameras, and handheld devices. - The dataset features high-precision temporal and spatial alignment and provides dense and high-quality annotations, including 3.99 million annotation boxes. 2. **Fusion Method Design**: - Proposed a novel progressive fusion tracking framework named PFTrack, which effectively integrates temporal information and complementary information from two modalities (visible light and thermal infrared) to enhance target feature representation. - Through a two-stage fusion module (PFM), including temporal feature fusion and multi-modal feature fusion, it fully utilizes multi-modal and temporal information to improve tracking performance. 3. **Experimental Validation**: - Conducted extensive experiments on the VT-MOT dataset, demonstrating the advantages and effectiveness of the proposed method compared to existing technologies, and pointed out future research directions. Through these efforts, the paper aims to advance the research and development of multi-object tracking under all-weather and all-time conditions.