Abstract:Multi-person motion prediction remains a challenging problem due to the intricate motion dynamics and complex interpersonal interactions, where uncertainty escalates rapidly across the forecasting horizon. Existing approaches always overlook the motion dynamic modeling among the prediction frames to reduce the uncertainty, but leave it entirely up to the deep neural networks, which lacks a dynamic inductive bias, leading to suboptimal performance. This paper addresses this limitation by proposing an effective multi-person motion prediction method named Hybrid Supervision Transformer (HSFormer), which formulates the dynamic modeling within the prediction horizon as a novel hybrid supervision task. To be precise, our method performs a rolling predicting process equipped with a hybrid supervision mechanism, which enforces the model to be able to predict the pose in the next frames based on the (typically error-contained) earlier predictions. Addition to the standard supervision loss, two self and auxiliary supervision mechanisms, which minimize the distance of the predictions with error-contained inputs and the predictions with error-free inputs (ground truth) and guide the model to make accurate predictions based on the ground truth, are introduced to improve the robustness of our model to the input deviation in inference and stabilize the training process, respectively. The optimization techniques, such as stop-gradient, are extended to our model to improve the training efficiency. Furthermore, we develop a fine-grained spatio-temporal correlation capture module to assist the feature learning and reduce the uncertainties arising from the intricate and varying interactions among the individuals. Our approach achieves state-of-the-art results on multiple multi-person datasets in both short- and long-term prediction.

Multiple Futures Prediction

Probabilistic Future Prediction for Video Scene Understanding

Multiple Future Prediction Leveraging Synthetic Trajectories

Multi‐future Transformer: Learning Diverse Interaction Modes for Behaviour Prediction in Autonomous Driving

Robust Trajectory Forecasting for Multiple Intelligent Agents in Dynamic Scene

Enhanced Multimodal Trajectory Prediction for Autonomous Vehicles Using Advanced Diffusion Model Techniques

Single-Agent LSTM Encoders Single-Agent LSTM Decoders Output : Multi-Agent Future Trajectories Input 1 : Multi-Agent Past Trajectories Input 2 : Scene Context Image Scene Context Channels Multi-Agent Channels Convolutional Operator : Multi-Agent Tensor Fusion

Multi-Agent Tensor Fusion for Contextual Trajectory Prediction

Enhanced Prediction of Multi-Agent Trajectories via Control Inference and State-Space Dynamics

Simultaneous Past and Current Social Interaction-aware Trajectory Prediction for Multiple Intelligent Agents in Dynamic Scenes

MultiPath++: Efficient Information Fusion and Trajectory Aggregation for Behavior Prediction

A multi-modal vehicle trajectory prediction framework via conditional diffusion model: A coarse-to-fine approach

Look Before You Drive: Boosting Trajectory Forecasting via Imagining Future

Multi-agent Trajectory Prediction with Fuzzy Query Attention

SceneMotion: From Agent-Centric Embeddings to Scene-Wide Forecasts

A multi-modal spatial–temporal model for accurate motion forecasting with visual fusion

Multi-Vehicle Collaborative Learning for Trajectory Prediction With Spatio-Temporal Tensor Fusion

Motion Forecasting in Continuous Driving

Multimodal Trajectory Prediction for Diverse Vehicle Types in Autonomous Driving with Heterogeneous Data and Physical Constraints

Future Motion Dynamic Modeling Via Hybrid Supervision for Multi-Person Motion Prediction Uncertainty Reduction

Motion Forecasting via Model-Based Risk Minimization