Abstract:Predicting the behavior of road users accurately is crucial to enable the safe operation of autonomous vehicles in urban or densely populated areas. Therefore, there has been a growing interest in time series motion prediction research, leading to significant advancements in state-of-the-art techniques in recent years. However, the potential of using LiDAR data to capture more detailed local features, such as a person's gaze or posture, remains largely unexplored. To address this, we develop a novel multimodal approach for motion prediction based on the PointNet foundation model architecture, incorporating local LiDAR features. Evaluation on the Waymo Open Dataset shows a performance improvement of 6.20% and 1.58% in minADE and mAP respectively, when integrated and compared with the previous state-of-the-art MTR. We open-source the code of our LiMTR model.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the prediction accuracy of the behaviors of different road users (such as pedestrians, cyclists and vehicles) by autonomous vehicles in urban or densely populated areas. Specifically, the author aims to improve the time - series motion prediction model by integrating LiDAR data to capture more detailed local features (such as a person's line of sight or posture). ### Problem Background Current time - series motion prediction mainly relies on coarser - grained modal information, such as the target position, speed, acceleration and bounding box output by the object detection step, combined with accurate road information. Although this representation is efficient, it may overlook some fine - grained information of the target, such as the line - of - sight direction of pedestrians or the posture of cyclists. These fine - grained information are very important for predicting the behaviors of vulnerable road users (such as pedestrians and cyclists). ### Solution To solve this problem, the author proposes a new multimodal method based on the PointNet architecture - LiMTR (LiDAR Motion Transformer), which directly integrates LiDAR data into the motion prediction model. LiMTR is implemented in the following ways: 1. **Local LiDAR Feature Extraction**: Only use the subset of LiDAR point clouds related to the target road users, helping the model focus on the specific features of the target (such as a person's posture or line - of - sight direction). 2. **LiDAR Encoder Design**: Design a LiDAR encoder based on the PointNet architecture, which can directly process LiDAR point cloud data without voxelization or other complex pre - processing steps. 3. **Performance Improvement**: Experiments on the Waymo Open Dataset show that LiMTR has improved by 6.20% and 1.58% in terms of minimum average displacement error (minADE) and mean average precision (mAP) respectively, especially when predicting vulnerable road users (such as pedestrians and cyclists). ### Conclusion By introducing local LiDAR features, LiMTR significantly improves the prediction accuracy of the future trajectories of different road users, especially for vulnerable road users such as pedestrians and cyclists. This enables autonomous vehicles to operate more safely in complex urban environments. ### Formula Representation - Minimum Average Displacement Error (minADE): \[ \text{minADE}=\min_{i = 1}^{m}\frac{1}{T}\sum_{t = 1}^{T}\|\mathbf{x}_t^{\text{pred},i}-\mathbf{x}_t^{\text{gt}}\|_2 \] where \(\mathbf{x}_t^{\text{pred},i}\) is the position of the \(i\)-th predicted trajectory at time \(t\), \(\mathbf{x}_t^{\text{gt}}\) is the position of the real trajectory at time \(t\), \(T\) is the number of time steps, and \(m\) is the number of predicted trajectories. - Mean Average Precision (mAP): \[ \text{mAP}=\frac{1}{C}\sum_{c = 1}^{C}\text{AP}_c \] where \(C\) is the number of road user categories (such as pedestrians, cyclists, vehicles), and \(\text{AP}_c\) is the average precision of the \(c\)-th category. Through these improvements, LiMTR provides more reliable support for the safe operation of autonomous vehicles.

LiMTR: Time Series Motion Prediction for Diverse Road Users through Multimodal Feature Integration

Enhanced Multimodal Trajectory Prediction for Autonomous Vehicles Using Advanced Diffusion Model Techniques

Large Language Models Powered Context-aware Motion Prediction in Autonomous Driving

MGTR: Multi-Granular Transformer for Motion Prediction with LiDAR

Map-Adaptive Multimodal Trajectory Prediction via Intention-Aware Unimodal Trajectory Predictors

ControlMTR: Control-Guided Motion Transformer with Scene-Compliant Intention Points for Feasible Motion Prediction

MotionLM: Multi-Agent Motion Forecasting As Language Modeling

Towards Practical Human Motion Prediction with LiDAR Point Clouds

Vehicle Motion State Prediction Method Integrating Point Cloud Time Series Multiview Features and Multitarget Interactive Information

Traj-LLM: A New Exploration for Empowering Trajectory Prediction with Pre-trained Large Language Models

Learning-enabled multi-modal motion prediction in urban environments

BE-STI: Spatial-Temporal Integrated Network for Class-agnostic Motion Prediction with Bidirectional Enhancement

TPNet: Trajectory Proposal Network for Motion Prediction

Multi-Modal Vehicle Trajectory Prediction by Collaborative Learning of Lane Orientation, Vehicle Interaction, and Intention

A Road-Aware Neural Network For Multi-Step Vehicle Trajectory Prediction

Multimodal Trajectory Prediction for Autonomous Driving on Unstructured Roads using Deep Convolutional Network

An Improved Multimodal Trajectory Prediction Method Based on Deep Inverse Reinforcement Learning

Shared Cross-Modal Trajectory Prediction for Autonomous Driving

Context‐aware trajectory prediction for autonomous driving in heterogeneous environments

TrafficPredict: Trajectory Prediction for Heterogeneous Traffic-Agents

ReCoAt: A Deep Learning-based Framework for Multi-Modal Motion Prediction in Autonomous Driving Application