Optical Flow Matters: an Empirical Comparative Study on Fusing Monocular Extracted Modalities for Better Steering

Fouad Makiyeh,Mark Bastourous,Anass Bairouk,Wei Xiao,Mirjana Maras,Tsun-Hsuan Wangb,Marc Blanchon,Ramin Hasani,Patrick Chareyre,Daniela Rus
2024-09-18
Abstract:Autonomous vehicle navigation is a key challenge in artificial intelligence, requiring robust and accurate decision-making processes. This research introduces a new end-to-end method that exploits multimodal information from a single monocular camera to improve the steering predictions for self-driving cars. Unlike conventional models that require several sensors which can be costly and complex or rely exclusively on RGB images that may not be robust enough under different conditions, our model significantly improves vehicle steering prediction performance from a single visual sensor. By focusing on the fusion of RGB imagery with depth completion information or optical flow data, we propose a comprehensive framework that integrates these modalities through both early and hybrid fusion techniques. We use three distinct neural network models to implement our approach: Convolution Neural Network - Neutral Circuit Policy (CNN-NCP) , Variational Auto Encoder - Long Short-Term Memory (VAE-LSTM) , and Neural Circuit Policy architecture VAE-NCP. By incorporating optical flow into the decision-making process, our method significantly advances autonomous navigation. Empirical results from our comparative study using Boston driving data show that our model, which integrates image and motion information, is robust and reliable. It outperforms state-of-the-art approaches that do not use optical flow, reducing the steering estimation error by 31%. This demonstrates the potential of optical flow data, combined with advanced neural network architectures (a CNN-based structure for fusing data and a Recurrence-based network for inferring a command from latent space), to enhance the performance of autonomous vehicles steering estimation.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve a key challenge in autonomous vehicle navigation, that is, how to improve vehicle steering prediction through multi - modal information obtained from a monocular camera. Specifically, the author attempts to enhance the steering prediction performance of autonomous vehicles based on a single visual sensor by fusing RGB images with depth completion information or optical flow data. #### Main problems: 1. **Cost and complexity of relying on multiple sensors**: Traditional autonomous driving models usually require multiple sensors (such as RGB cameras, radars, lidars, etc.), which not only increase the cost but also bring problems of system complexity and synchronous calibration. 2. **Limitations of relying solely on RGB images**: Models relying solely on RGB images may not be robust enough under different lighting conditions, road textures and lane markings. 3. **Improving the accuracy of steering prediction**: By introducing optical flow data, the model's understanding of dynamic scenes is enhanced, thereby improving the accuracy of steering prediction. #### Solutions: - **Multi - modal fusion**: Extract multiple modal information (such as RGB images, depth maps and optical flow) from a single monocular camera, and combine these modal information through early fusion and hybrid fusion techniques. - **Neural network architecture**: Use three different neural network models (CNN - NCP, VAE - LSTM and VAE - NCP) to process the fused multi - modal input to achieve more accurate steering prediction. - **Experimental verification**: Conduct an empirical study through the Boston driving data set, which proves that fusing optical flow information can significantly reduce the steering estimation error (by 31%) and improve the generalization ability and robustness of the model. ### Formula summary - **Early - fusion input representation**: \[ x_{EF}=[M_1, M_2] \] where \(M_1\) represents an RGB image and \(M_2\) represents an additional modality (depth or optical flow). - **Hybrid - fusion output representation**: \[ z_{HF}=\text{Layer }4+\text{ACM}(G_5(\tilde{M}_1))+\text{ACM}(G_5(\tilde{M}_2)) \] where: \[ \tilde{M}_1 = G_4(G_3(G_2(G_1(M_1)))) \] \(G_1,\cdots, G_5\) represent the layers of the encoder, and ACM is the attention complementary module. - **End - to - end steering estimation loss function**: \[ L(x,\hat{y})=\beta L_{VAE}+L_{\text{prediction}} \] where: \[ L_{VAE}=\lambda_1 L_{\text{recon}}(x,\tilde{x})+\lambda_2 L_{KL}(\mu,\sigma) \] \[ L_{\text{prediction}}=\frac{\sum_i w(y_i)(\hat{y}_i - y_i)^2}{\sum_i w(y_i)}; \quad y = \text{RNN}(z) \] Here, \(x\) represents the input, \(\beta\) distinguishes between using CNN (\(\beta = 0\)) and VAE (\(\beta = 1\)) for feature extraction, \(\lambda_1 = 0.15\), \(\lambda_2=\lambda_1 e^{- 2}\) are regularization parameters, \(\tilde{x}\) represents the reconstructed \(x\), \(\mu\) and \(\sigma\) are the parameters used for sampling latent variables in VAE, \(w(y)=\exp(|y|^\alpha)\)