Hyper-STTN: Social Group-aware Spatial-Temporal Transformer Network for Human Trajectory Prediction with Hypergraph Reasoning

Weizheng Wang,Chaowei Wang,Baijian Yang,Guohua Chen,Byung-Cheol Min
2024-09-18
Abstract:Predicting crowded intents and trajectories is crucial in varouls real-world applications, including service robots and autonomous vehicles. Understanding environmental dynamics is challenging, not only due to the complexities of modeling pair-wise spatial and temporal interactions but also the diverse influence of group-wise interactions. To decode the comprehensive pair-wise and group-wise interactions in crowded scenarios, we introduce Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. In Hyper-STTN, crowded group-wise correlations are constructed using a set of multi-scale hypergraphs with varying group sizes, captured through random-walk robability-based hypergraph spectral convolution. Additionally, a spatial-temporal transformer is adapted to capture pedestrians' pair-wise latent interactions in spatial-temporal dimensions. These heterogeneous group-wise and pair-wise are then fused and aligned though a multimodal transformer network. Hyper-STTN outperformes other state-of-the-art baselines and ablation models on 5 real-world pedestrian motion datasets.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the complex social interaction problem in crowd trajectory prediction, particularly in predicting the intentions and trajectories of crowded people in social environments. Specifically, the paper attempts to solve the following key issues: 1. **Understanding Environmental Dynamics**: In crowded scenarios, it is essential to model not only the spatial and temporal interactions between individuals but also the complex influences between groups. This is crucial for understanding and predicting human behavior. 2. **High-Order Interaction Description**: Existing methods often lack effective descriptions of high-order interactions (such as interactions between groups) and the ability to reason about heterogeneous features. 3. **Subjective Intention Prediction**: Accurately predicting an individual's subjective intentions based on limited information remains challenging, especially in highly dynamic or complex scenarios. To address these issues, the authors propose a new framework called Hyper-STTN, which combines multi-scale hypergraphs and spatial-temporal transformer networks to capture pairwise interactions between individuals and interactions between groups. Additionally, a multi-modal transformer network is used to fuse these heterogeneous features. Experimental results show that Hyper-STTN outperforms existing state-of-the-art algorithms on multiple public crowd trajectory datasets.