NeurAll: Towards a Unified Visual Perception Model for Automated Driving

Ganesh Sistu,Isabelle Leang,Sumanth Chennupati,Senthil Yogamani,Ciaran Hughes,Stefan Milz,Samir Rawashdeh
2024-03-10
Abstract:Convolutional Neural Networks (CNNs) are successfully used for the important automotive visual perception tasks including object recognition, motion and depth estimation, visual SLAM, etc. However, these tasks are typically independently explored and modeled. In this paper, we propose a joint multi-task network design for learning several tasks simultaneously. Our main motivation is the computational efficiency achieved by sharing the expensive initial convolutional layers between all tasks. Indeed, the main bottleneck in automated driving systems is the limited processing power available on deployment hardware. There is also some evidence for other benefits in improving accuracy for some tasks and easing development effort. It also offers scalability to add more tasks leveraging existing features and achieving better generalization. We survey various CNN based solutions for visual perception tasks in automated driving. Then we propose a unified CNN model for the important tasks and discuss several advanced optimization and architecture design techniques to improve the baseline model. The paper is partly review and partly positional with demonstration of several preliminary results promising for future research. We first demonstrate results of multi-stream learning and auxiliary learning which are important ingredients to scale to a large multi-task model. Finally, we implement a two-stream three-task network which performs better in many cases compared to their corresponding single-task models, while maintaining network size.
Computer Vision and Pattern Recognition,Machine Learning,Robotics
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issue of unified modeling for visual perception tasks in autonomous driving. Specifically, it proposes a multi-task joint network design named NeurAll to achieve simultaneous learning of several important visual perception tasks. The main motivations are as follows: 1. **Computational Efficiency**: By sharing expensive initial convolutional layers across all tasks, computational efficiency can be significantly improved. In autonomous driving systems, the processing power of deployed hardware is limited, so improving computational efficiency is crucial. 2. **Accuracy Improvement**: Some studies suggest that multi-task learning can improve the accuracy of certain tasks and simplify development work. 3. **Scalability**: The model can easily add more tasks, leveraging existing features to achieve better generalization. ### Main Contributions 1. **Unified Model Design**: The paper proposes a unified CNN model to handle key visual perception tasks in autonomous driving, such as object recognition, motion estimation, depth estimation, and localization. 2. **Multi-Stream Learning**: A multi-stream architecture is introduced to capture temporal cues by processing consecutive frames, further enhancing model performance. 3. **Auxiliary Learning**: The performance of primary tasks (e.g., semantic segmentation) is enhanced by introducing auxiliary tasks (e.g., depth regression). 4. **Experimental Validation**: Experiments demonstrate the performance improvement of the multi-task model on multiple datasets, particularly in video segmentation and semantic segmentation tasks. ### Experimental Results 1. **Multi-Stream Learning**: The multi-stream model performs excellently in video segmentation tasks, with performance improvements of 11% and 4% (on KITTI and SYNTHIA validation sets) compared to the single-stream model, with only a slight increase in computational complexity. 2. **Auxiliary Learning**: By introducing depth regression as an auxiliary task, the performance of the semantic segmentation task is significantly improved, with IoU metrics increasing by 4% and 3% (on KITTI and SYNTHIA validation sets). 3. **Comparison of Multi-Task and Single-Task Models**: The two-stream three-task unified model outperforms the corresponding single-task models across multiple tasks while maintaining the network scale. ### Conclusion By proposing the NeurAll model, the paper demonstrates the potential of multi-task learning in visual perception tasks for autonomous driving. The model not only improves computational efficiency but also enhances task accuracy and has good scalability. Future research directions include building larger-scale multi-task models and providing more task data set support.