Valeo4Cast: A Modular Approach to End-to-End Forecasting

Yihong Xu,Éloi Zablocki,Alexandre Boulch,Gilles Puy,Mickael Chen,Florent Bartoccioni,Nermin Samet,Oriane Siméoni,Spyros Gidaris,Tuan-Hung Vu,Andrei Bursuc,Eduardo Valle,Renaud Marlet,Matthieu Cord
2024-09-27
Abstract:Motion forecasting is crucial in autonomous driving systems to anticipate the future trajectories of surrounding agents such as pedestrians, vehicles, and traffic signals. In end-to-end forecasting, the model must jointly detect and track from sensor data (cameras or LiDARs) the past trajectories of the different elements of the scene and predict their future locations. We depart from the current trend of tackling this task via end-to-end training from perception to forecasting, and instead use a modular approach. We individually build and train detection, tracking and forecasting modules. We then only use consecutive finetuning steps to integrate the modules better and alleviate compounding errors. We conduct an in-depth study on the finetuning strategies and it reveals that our simple yet effective approach significantly improves performance on the end-to-end forecasting benchmark. Consequently, our solution ranks first in the Argoverse 2 End-to-end Forecasting Challenge, with 63.82 mAPf. We surpass forecasting results by +17.1 points over last year's winner and by +13.3 points over this year's runner-up. This remarkable performance in forecasting can be explained by our modular paradigm, which integrates finetuning strategies and significantly outperforms the end-to-end-trained counterparts. The code, model weights and results are made available <a class="link-external link-https" href="https://github.com/valeoai/valeo4cast" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to more accurately predict the future trajectories of surrounding pedestrians, vehicles and other traffic signals in an autonomous driving system. Specifically, the paper explores the challenges in the end - to - end motion prediction task and proposes a modular approach to improve this process. ### Main problems and background of the paper In an autonomous driving system, motion prediction is crucial as it can help the vehicle perceive and respond to the surrounding dynamic environment in advance. Traditional end - to - end models attempt to directly predict future trajectories from sensor data (such as cameras or LiDAR), but these models usually encounter the problem of cumulative error because the errors in detection and tracking will be magnified in the prediction stage. ### Limitations of existing methods Although the current end - to - end training methods have achieved certain success, they also have some problems: - **Cumulative error**: Due to the fact that the errors in the detection and tracking modules are not compensated, the prediction results are inaccurate. - **High resource consumption**: End - to - end training requires a large amount of computing resources. - **Poor flexibility**: Once a certain module has a problem, the entire system needs to be retrained. ### Proposed solutions To solve the above problems, the paper proposes a modular framework - Valeo4Cast. This framework separates the three tasks of detection, tracking and prediction for processing, and then integrates them together through a finetuning strategy. The specific steps are as follows: 1. **Independent training**: Train the detection, tracking and prediction modules separately to ensure that each module can perform well on its own task. 2. **Finetuning**: Use the outputs of the detection and tracking modules as inputs to finetune the prediction module to adapt to the imperfect inputs in the actual scenario. 3. **Post - processing**: In order to ensure the accuracy of the trajectory prediction of static objects, a post - processing step is introduced. ### Key technical points - **Pre - training**: Use a large - scale data set (such as the UniTraj framework) to pre - train the prediction module to better initialize the model. - **Matching strategy**: By matching the prediction results and the real labels, ensure that the errors in the detection and tracking modules can be effectively processed during the finetuning process. - **Post - processing**: For static objects, insert static trajectory prediction to avoid the model being overly biased towards predicting moving objects. ### Experimental results Experiments show that Valeo4Cast has achieved remarkable results in the Argoverse 2 end - to - end prediction challenge, with mAP f reaching 63.82, which is 17.1 points higher than last year's champion and 13.3 points higher than this year's runner - up. This proves the effectiveness of the modular approach, especially when dealing with imperfect perception results. ### Summary The paper successfully solves the cumulative error problem in end - to - end motion prediction through a modular approach and significantly improves the prediction performance. This method not only reduces resource consumption but also improves the flexibility and robustness of the system.