Abstract:With the development of AI-assisted driving, numerous methods have emerged for ego-vehicle 3D perception tasks, but there has been limited research on roadside perception. With its ability to provide a global view and a broader sensing range, the roadside perspective is worth developing. LiDAR provides precise three-dimensional spatial information, while cameras offer semantic information. These two modalities are complementary in 3D detection. However, adding camera data does not increase accuracy in some studies since the information extraction and fusion procedure is not sufficiently reliable. Recently, Kolmogorov-Arnold Networks (KANs) have been proposed as replacements for MLPs, which are better suited for high-dimensional, complex data. Both the camera and the LiDAR provide high-dimensional information, and employing KANs should enhance the extraction of valuable features to produce better fusion outcomes. This paper proposes Kaninfradet3D, which optimizes the feature extraction and fusion modules. To extract features from complex high-dimensional data, the model's encoder and fuser modules were improved using KAN Layers. Cross-attention was applied to enhance feature fusion, and visual comparisons verified that camera features were more evenly integrated. This addressed the issue of camera features being abnormally concentrated, negatively impacting fusion. Compared to the benchmark, our approach shows improvements of +9.87 mAP and +10.64 mAP in the two viewpoints of the TUMTraf Intersection Dataset and an improvement of +1.40 mAP in the roadside end of the TUMTraf V2X Cooperative Perception Dataset. The results indicate that Kaninfradet3D can effectively fuse features, demonstrating the potential of applying KANs in roadside perception tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: currently, roadside perception performs worse than ego - vehicle perception in 3D object detection tasks, especially having deficiencies in feature extraction and fusion. Specifically, existing roadside perception models fail to fully utilize the complementary information of cameras and LiDAR sensors, resulting in limited performance improvement after fusion. ### Specific manifestations of the problem: 1. **Importance of roadside perception**: - Roadside perception provides a global perspective, which can avoid the problem of target occlusion, reduce perception blind areas, and provide target detection at a longer distance. - Compared with ego - vehicle perception that depends on advanced sensors and computing resources on the vehicle, roadside perception can utilize the sensing devices and computing power of public infrastructure, which helps to reduce the cost of the autonomous driving system and promote its popularization. 2. **Limitations of existing methods**: - Although LiDAR provides accurate three - dimensional spatial information and cameras provide rich semantic information, fusing the two does not always improve the detection accuracy. For example, in some studies, adding camera data did not significantly improve the accuracy because the information extraction and fusion processes were not reliable enough. - Existing multi - modal fusion methods mainly focus on the ego - vehicle perspective, while there are relatively few studies from the roadside perspective and lack of effective fusion frameworks. 3. **Application potential of KANs**: - The recently proposed Kolmogorov - Arnold Networks (KANs) model performs excellently in processing high - dimensional complex data and can replace traditional multi - layer perceptrons (MLPs) to enhance the non - linear feature extraction ability. - KANs have shown advantages in the field of visual data processing, but have not been fully applied in 3D object detection tasks. ### Solution: To solve the above problems, this paper proposes the Kaninfradet3D model, which optimizes the feature extraction and fusion modules in the following ways: 1. **Improved feature encoder**: - Use KAN Layers to replace traditional linear layers, enhancing the feature extraction ability for high - dimensional complex data. - Introduce KAN Layers in the PointEncoder to improve the processing effect of LiDAR point cloud data. 2. **Cross - modal attention mechanism**: - Introduce the Camera - LiDAR CrossAttn module, which captures the internal relationship between camera and LiDAR features through the multi - head cross - attention mechanism to ensure that the fused features are more balanced and meaningful. - This mechanism effectively solves the problem that camera features are abnormally concentrated and affect the fusion effect. 3. **Convolutional fusion module**: - Use KANConv to replace traditional convolutional layers, further enhancing the feature learning ability. - The ConKANfuser module further improves the fusion effect of different modal features in the fusion stage. ### Experimental results: Verified by experiments on the TUMTraf V2X Cooperative Perception Dataset and the TUMTraf Intersection Dataset, the Kaninfradet3D model has achieved significant improvements in multiple evaluation metrics, especially in the mAP (mean Average Precision) metric, with obvious improvements compared to the baseline model. This indicates that KANs have great potential in roadside perception tasks. ### Conclusion: This paper successfully solves the problems of insufficient feature extraction and poor fusion effect in roadside perception by introducing KANs and their improved feature extraction and fusion modules, providing new ideas and methods for future roadside 3D object detection research.

Kaninfradet3D:A Road-side Camera-LiDAR Fusion 3D Perception Model based on Nonlinear Feature Extraction and Intrinsic Correlation

Multi-Stage Residual Fusion Network for LIDAR-Camera Road Detection

From One to Many: Dynamic Cross Attention Networks for LiDAR and Camera Fusion

RI-Fusion: 3D Object Detection Using Enhanced Point Features With Range-Image Fusion for Autonomous Driving.

Multi-View Adaptive Fusion Network for 3D Object Detection

Real time object detection using LiDAR and camera fusion for autonomous driving

ACF-Net: Asymmetric Cascade Fusion for 3D Detection with LiDAR Point Clouds and Images

KAN-RCBEVDepth: A multi-modal fusion algorithm in object detection for autonomous driving

Depth Completion via Inductive Fusion of Planar LIDAR and Monocular Camera

Real-time depth completion based on LiDAR-stereo for autonomous driving

Cascade fusion of multi-modal and multi-source feature fusion by the attention for three-dimensional object detection

BAFusion: Bidirectional Attention Fusion for 3D Object Detection Based on LiDAR and Camera

Building and optimization of 3D semantic map based on Lidar and camera fusion

InfraDet3D: Multi-Modal 3D Object Detection based on Roadside Infrastructure Camera and LiDAR Sensors

FS-Net: LiDAR-Camera Fusion With Matched Scale for 3D Object Detection in Autonomous Driving

TFIENet: Transformer Fusion Information Enhancement Network for Multi-Model 3D Object Detection

V2X-AHD:Vehicle-to-Everything Cooperation Perception via Asymmetric Heterogenous Distillation Network

Influence of Camera-LiDAR Configuration on 3D Object Detection for Autonomous Driving

LFP: Efficient and Accurate End-to-End Lane-Level Planning via Camera-LiDAR Fusion

PA3DNet: 3-D Vehicle Detection with Pseudo Shape Segmentation and Adaptive Camera-LiDAR Fusion

InterFusion: Interaction-based 4D Radar and LiDAR Fusion for 3D Object Detection