An End-to-end Tracking Framework Via Multi-View and Temporal Feature Aggregation

Yihan Yang,Ming Xu,Jason F. Ralph,Yuchen Ling,Xiaonan Pan
DOI: https://doi.org/10.1016/j.cviu.2024.104203
2024-01-01
Abstract:Multi-view pedestrian tracking has frequently been used to cope with the challenges of occlusion and limited fields-of-view in single-view tracking. However, there are few end-to-end methods in this field. Many existing algorithms detect pedestrians in individual views, cluster projected detections in atop view and then track them. The others track pedestrians in individual views and then associate the projected tracklets in atop view. In this paper, an end-to-end framework is proposed for multi-view tracking, in which both multi-view and temporal aggregations of feature maps are applied. The multi-view aggregation projects the per-view feature maps to atop view, uses a transformer encoder to output encoded feature maps and then uses a CNN to calculate a pedestrian occupancy map. The temporal aggregation uses another CNN to estimate position offsets from the encoded feature maps in consecutive frames. Our experiments have demonstrated that this end-to-end framework outperforms the state-of-the-art online algorithms for multi-view pedestrian tracking.
What problem does this paper attempt to address?