Zhuyun Zhou,Zongwei Wu,Florian Bolli,Rémi Boutteau,Fan Yang,Radu Timofte,Dominique Ginhac,Tobi Delbruck
Abstract:Autonomous racing has rapidly gained research attention. Traditionally, racing cars rely on 2D LiDAR as their primary visual system. In this work, we explore the integration of an event camera with the existing system to provide enhanced temporal information. Our goal is to fuse the 2D LiDAR data with event data in an end-to-end learning framework for steering prediction, which is crucial for autonomous racing. To the best of our knowledge, this is the first study addressing this challenging research topic. We start by creating a multisensor dataset specifically for steering prediction. Using this dataset, we establish a benchmark by evaluating various SOTA fusion methods. Our observations reveal that existing methods often incur substantial computational costs. To address this, we apply low-rank techniques to propose a novel, efficient, and effective fusion design. We introduce a new fusion learning policy to guide the fusion process, enhancing robustness against misalignment. Our fusion architecture provides better steering prediction than LiDAR alone, significantly reducing the RMSE from 7.72 to 1.28. Compared to the second-best fusion method, our work represents only 11% of the learnable parameters while achieving better accuracy. The source code, dataset, and benchmark will be released to promote future research.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the steering angle prediction problem in autonomous racing cars**, especially on the F1tenth prototype car. Specifically, traditional methods mainly rely on 2D LiDAR as the main vision system, but this method has the following limitations:
1. **Insufficient spatial perception**: 2D LiDAR is only sensitive to depth changes and lacks spatial perception on the Y - axis and Z - axis.
2. **Lack of temporal cues**: 2D LiDAR cannot provide sufficient dynamic information, resulting in perception delays easily in high - speed dynamic environments, which affects the vehicle's ability to make quick decisions.
To solve these problems, the author proposes a new multi - sensor fusion method, combining 2D LiDAR with an event camera to enhance the accuracy and real - time performance of steering prediction. The following are the main contributions of this study:
- **Creation of a multi - sensor dataset**: A multi - sensor dataset was specifically created for steering prediction to evaluate different fusion methods.
- **Application of low - rank techniques**: To reduce the computational cost of existing fusion methods, low - rank techniques were introduced and a new, efficient and effective fusion architecture was designed.
- **New fusion learning strategy**: A new fusion learning strategy was proposed. By maximizing the joint entropy between the two sensor inputs, the robustness of the fusion process was improved, especially in cases of poor sensor alignment.
- **Significant error reduction**: Compared with using only 2D LiDAR, the new method significantly reduces the RMSE from 7.72 to 1.28, and the number of parameters is only 11% of that of the sub - optimal fusion method, while achieving better accuracy.
Through these improvements, this study not only improves the performance of steering prediction in autonomous racing cars but also provides valuable benchmarks and datasets for future research.
### Formula Summary
- **Event stream definition**:
\[
\varepsilon=\{e_i|e_i = ((x_i,y_i),t_i,p_i),t_i\in[t_{\text{start}},t_{\text{end}}]\}
\]
where \(e_i\) represents a single event, \((x,y)\) are pixel coordinates, \(t\) is a timestamp, and \(p\in\{+ 1,-1\}\) represents the polarity of the brightness change.
- **Projection model**:
\[
p_{\text{image}}=K[R|t]P_{\text{LiDAR}}
\]
where \(P_{\text{LiDAR}}\) is a 3D point in the LiDAR coordinate system, \(R\in\mathbb{R}^{3\times3}\) is a rotation matrix, \(t\in\mathbb{R}^{3\times1}\) is a translation vector, and \(K\in\mathbb{R}^{3\times3}\) is the internal parameter matrix of the event camera.
- **Similarity loss**:
\[
L_{\text{div}}=L_{\text{KL}}(f_S,f_D)+L_{\text{KL}}(f_S,f_E)
\]
\[
L_{\text{KL}}(A,B)=\text{KL}(A||B)+\text{KL}(B||A)
\]
- **Overall loss function**:
\[
L = \lambda\cdot L_{\text{div}}+L_2
\]
where \(\lambda\) is a hyperparameter, set to 0.25.
- **Root - mean - square error (RMSE)**:
\[
\text{RMSE}=\sqrt{\frac{1}{N}\sum_{i = 1}^N(y_i-\hat{y}_i)^2}
\]
- **Mean absolute error (MAE)**:
\[
\text{MAE}=\frac{1}{N}\sum_{i = 1}