A multimodal stepwise-coordinating framework for pedestrian trajectory prediction

Yijun Wang,Zekun Guo,Chang Xu,Jianxin Lin
DOI: https://doi.org/10.1016/j.knosys.2024.112038
IF: 8.139
2024-06-03
Knowledge-Based Systems
Abstract:Pedestrian trajectory prediction from the first-person view has still been considered one of the challenging problems in automatic driving due to the difficulty of understanding and predicting pedestrian actions. Observing that pedestrian motion naturally contains the repetitive pattern of the gait cycle and global intention information, we design a Multimodal Stepwise-Coordinating Network, namely MSCN, to sufficiently leverage the underlying human motion properties. Specifically, we first design a multimodal spatial-frequency encoder, which encodes the periodicity of pedestrian motion with a frequency-domain enhanced Transformer and other visual information with a spatial-domain Transformer. Then, we propose a stepwise-coordinating decoder structure, which leverages both local and global information in sequence decoding through a two-stage decoding process. After generating a coarse sequence from the stepwise trajectory predictor, we design a coordinator to aggregate the corresponding representations used to generate the coarse sequence. Subsequently, the coordinator learns to output a refined sequence through a knowledge distillation process based on the aggregated representations. In this way, MSCN can adequately capture the representations of short-term motion behaviors, thus modeling better long-term sequence prediction. Extensive experiments show that the proposed model can achieve significant improvements over state-of-the-art approaches on the PIE and JAAD datasets by 16.1% and 16.4% respectively.
computer science, artificial intelligence
What problem does this paper attempt to address?