Abstract:To model the indeterminacy of human behaviors, stochastic trajectory prediction requires a sophisticated multi-modal distribution of future trajectories. Emerging diffusion models have revealed their tremendous representation capacities in numerous generation tasks, showing potential for stochastic trajectory prediction. However, expensive time consumption prevents diffusion models from real-time prediction, since a large number of denoising steps are required to assure sufficient representation ability. To resolve the dilemma, we present LEapfrog Diffusion model (LED), a novel diffusion-based trajectory prediction model, which provides real-time, precise, and diverse predictions. The core of the proposed LED is to leverage a trainable leapfrog initializer to directly learn an expressive multi-modal distribution of future trajectories, which skips a large number of denoising steps, significantly accelerating inference speed. Moreover, the leapfrog initializer is trained to appropriately allocate correlated samples to provide a diversity of predicted future trajectories, significantly improving prediction performances. Extensive experiments on four real-world datasets, including NBA/NFL/SDD/ETH-UCY, show that LED consistently improves performance and achieves 23.7%/21.9% ADE/FDE improvement on NFL. The proposed LED also speeds up the inference 19.3/30.8/24.3/25.1 times compared to the standard diffusion model on NBA/NFL/SDD/ETH-UCY, satisfying real-time inference needs. Code is available at <a class="link-external link-https" href="https://github.com/MediaBrain-SJTU/LED" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The paper primarily addresses two key issues when using diffusion models for stochastic trajectory prediction in real-time applications:
1. **Real-time inference takes too long**: To ensure representational capacity and generate high-quality samples, standard diffusion models require a large number of denoising steps, which consume more computational time. For example, on the NBA dataset, diffusion models need approximately 100 denoising steps to achieve decent prediction performance, which takes about 886 milliseconds to complete one prediction, while the next frame of data arrives every 200 milliseconds.
2. **Independent and identically distributed samples may not capture enough modes in the underlying distribution**: A limited number of independent and identically distributed samples may fail to capture enough modes in the underlying distribution of the generative model. Empirically, a few independently sampled trajectories may miss some important future possibilities, significantly reducing prediction performance due to the lack of proper sample allocation.
To address the above issues, the authors propose the LEapfrog Diffusion model (LED), a novel denoising diffusion-based stochastic trajectory prediction model that significantly accelerates inference speed and adaptively allocates multiple related predictions to provide prediction diversity.
### Main Contributions
1. **Proposed a new LEapfrog Diffusion model (LED)**, a denoising diffusion-based stochastic trajectory prediction model. LED achieves accurate and diverse predictions with fast inference speed.
2. **Introduced a new trainable "leapfrog" initializer** that can directly model complex denoising distributions, accelerate inference speed, and adaptively allocate sample diversity to improve prediction performance.
3. **Conducted extensive experiments on four datasets**, including NBA, NFL football dataset, Stanford Drone Dataset, and ETH-UCY dataset. The results show that the proposed method achieves state-of-the-art performance on all datasets compared to previous methods; and compared to standard diffusion models, it improves inference speed by approximately 20 times, meeting the needs of real-time prediction.
Through these contributions, the paper demonstrates how to effectively leverage the advantages of diffusion models while overcoming their limitations in real-time applications, particularly in complex human behavior prediction tasks.