Abstract:For autonomous driving, it is imperative to perform various high-computation image recognition tasks with high accuracy, utilizing diverse sensors to perceive the surrounding environment. Specifically, cameras are used to perform lane detection, object detection, and segmentation, and, in the absence of lidar, tasks extend to inferring 3D information through depth estimation, 3D object detection, 3D reconstruction, and SLAM. However, accurately processing all these image recognition operations in real-time for autonomous driving under constrained hardware conditions is practically unfeasible. In this study, considering the characteristics of image recognition tasks performed by these sensors and the given hardware conditions, we investigated MTL (multi-task learning), which enables parallel execution of various image recognition tasks to maximize their processing speed, accuracy, and memory efficiency. Particularly, this study analyzes the combinations of image recognition tasks for autonomous driving and proposes the MDO (multi-task decision and optimization) algorithm, consisting of three steps, as a means for optimization. In the initial step, a MTS (multi-task set) is selected to minimize overall latency while meeting minimum accuracy requirements. Subsequently, additional training of the shared backbone and individual subnets is conducted to enhance accuracy with the predefined MTS. Finally, both the shared backbone and each subnet undergo compression while maintaining the already secured accuracy and latency performance. The experimental results indicate that integrated accuracy performance is critically important in the configuration and optimization of MTL, and this integrated accuracy is determined by the ITC (inter-task correlation). The MDO algorithm was designed to consider these characteristics and construct multi-task sets with tasks that exhibit high ITC. Furthermore, the implementation of the proposed MDO algorithm, coupled with additional SSL (semi-supervised learning) based training, resulted in a significant performance enhancement. This advancement manifested as approximately a 12% increase in object detection mAP performance, a 15% improvement in lane detection accuracy, and a 27% reduction in latency, surpassing the results of previous three-task learning techniques like YOLOP and HybridNet.

Research on Road Scene Understanding of Autonomous Vehicles Based on Multi-Task Learning

Research on Autonomous Driving Image Recognition Based on a New Real-Time Object Detection Model YOLOv5st

Multi-Task Environmental Perception Methods for Autonomous Driving

Multi-Task Visual Perception for Object Detection and Semantic Segmentation in Intelligent Driving

You Only Look at Once for Real-time and Generic Multi-Task

A panoramic driving perception fusion algorithm based on multi-task learning

YOLOMH: you only look once for multi-task driving perception with high efficiency

Research on multitask model of object detection and road segmentation in unstructured road scenes

Research on multi-object detection technology for road scenes based on SDG-YOLOv5

Road Scene Multi-Object Detection Algorithm Based on CMS-YOLO

MCS-YOLO: A Multiscale Object Detection Method for Autonomous Driving Road Environment Recognition

YOLOP: You Only Look Once for Panoptic Driving Perception

Optimal Configuration of Multi-Task Learning for Autonomous Driving

A novel real-time object detection method for complex road scenes based on YOLOv7-tiny

An End-to-End Multi-Task Learning Model for Drivable Road Detection via Edge Refinement and Geometric Deformation

Multi-Task Deep Learning Model for Autonomous Driving: Object Detection, Semantic Segmentation, and Depth Estimation

Q-YOLOP: Quantization-aware You Only Look Once for Panoptic Driving Perception

Joint Semantic Understanding with a Multilevel Branch for Driving Perception

YOLO-U: multi-task model for vehicle detection and road segmentation in UAV aerial imagery

A Multi-Scale Traffic Object Detection Algorithm for Road Scenes Based on Improved YOLOv5

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives