Abstract:The cornerstone of autonomous vehicles (AV) is a solid perception system, where camera encoders play a crucial role. Existing works usually leverage pre-trained Convolutional Neural Networks (CNN) or Vision Transformers (ViTs) designed for general vision tasks, such as image classification, segmentation, and 2D detection. Although those well-known architectures have achieved state-of-the-art accuracy in AV-related tasks, e.g., 3D Object Detection, there remains significant potential for improvement in network design due to the nuanced complexities of industrial-level AV dataset. Moreover, existing public AV benchmarks usually contain insufficient data, which might lead to inaccurate evaluation of those <a class="link-external link-http" href="http://architectures.To" rel="external noopener nofollow">this http URL</a> reveal the AV-specific model insights, we start from a standard general-purpose encoder, ConvNeXt and progressively transform the design. We adjust different design parameters including width and depth of the model, stage compute ratio, attention mechanisms, and input resolution, supported by systematic analysis to each modifications. This customization yields an architecture optimized for AV camera encoder achieving 8.79% mAP improvement over the baseline. We believe our effort could become a sweet cookbook of image encoders for AV and pave the way to the next-level drive system.

What problem does this paper attempt to address?

This paper attempts to solve the problem of camera encoder design in the autonomous driving perception system, especially the challenges faced when dealing with industrial - level autonomous vehicle (AV) datasets. Specifically, the paper aims to: 1. **Improve the performance of existing models in autonomous driving tasks**: Although existing pre - trained convolutional neural networks (CNNs) and vision transformers (ViTs) perform well in general - purpose vision tasks (such as image classification, segmentation, and 2D detection), there is still room for improvement in autonomous - driving - related tasks (such as 3D object detection). The paper hopes to optimize these models through customized design to adapt to the complexity and diversity of autonomous driving datasets. 2. **Address the characteristics of autonomous driving datasets**: - **Differences in class distribution**: Compared with general - purpose datasets, autonomous driving datasets usually involve fewer classes, but there are more samples for each class. - **Diversity of sensor types**: Autonomous driving datasets use multiple types of camera sensors with different fields of view and resolutions. - **Wider detection range**: Autonomous driving datasets require models to have stronger positioning capabilities, especially for detecting distant and small objects. - **Diversity of scenes**: Autonomous driving datasets cover more diverse driving scenes, while the scope of public datasets is limited, making it difficult to accurately evaluate model performance. 3. **Optimize the model architecture to improve performance**: By adjusting the design parameters of the model (such as width, depth, stage - computation ratio, attention mechanism, and input resolution) and conducting a systematic analysis, the paper proposes an optimized encoder architecture that can achieve higher accuracy in autonomous driving tasks. 4. **Provide a design guide for image encoders for autonomous driving**: The paper not only shows how customized design can significantly improve model performance (8.79% mAP improvement), but also provides valuable experience and references for future research to help build more efficient autonomous driving perception systems. ### Formula Representation To ensure the correctness and readability of the formulas, the following are the Markdown - format representations of some key formulas and concepts involved in the paper: - **mAP (mean Average Precision)**: \[ \text{mAP} = \frac{1}{N} \sum_{i = 1}^{N} \text{AP}_i \] where \( N \) is the number of classes and \(\text{AP}_i\) is the average precision of the \( i \)-th class. - **Stage Compute Ratio**: \[ \text{Stage Compute Ratio} = \left( \frac{\text{Proportion of computing resources allocated to each stage}}{\text{Total computing resources}} \right) \] - **Input Resolution's Influence**: \[ \text{Input Resolution} = W\times H \] where \( W \) and \( H \) are the width and height of the input image respectively. Through these improvements, the paper shows how customized design can significantly enhance the performance of the autonomous driving perception system, providing an important reference for the future development of autonomous driving technology.

Exploring Camera Encoder Designs for Autonomous Driving Perception

Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

Enhanced encoder–decoder architecture for visual perception multitasking of autonomous driving

3D Vehicle Detection Using Cheap LiDAR and Camera Sensors.

Scalable Primitives for Generalized Sensor Fusion in Autonomous Vehicles

NVAutoNet: Fast and Accurate 360$^{\circ}$ 3D Visual Perception For Self Driving

Think Twice Before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

NeurAll: Towards a Unified Visual Perception Model for Automated Driving

Infrastructure-Assisted Collaborative Perception in Automated Valet Parking: A Safety Perspective

HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras

Multimodal End-to-End Autonomous Driving

Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline

Understanding Bird's-Eye View of Road Semantics using an Onboard Camera

Project AutoVision: Localization and 3D Scene Perception for an Autonomous Vehicle with a Multi-Camera System

Exploring Driving Behavior for Autonomous Vehicles Based on Gramian Angular Field Vision Transformer

UniDrive: Towards Universal Driving Perception Across Camera Configurations

Improved Single Camera BEV Perception Using Multi-Camera Training

A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird's Eye View

Delving Into the Devils of Bird's-Eye-View Perception: A Review, Evaluation and Recipe

Simple-BEV: What Really Matters for Multi-Sensor BEV Perception?

A Camera-Based End-to-End Autonomous Driving Framework Combined with Meta-Based Multi-task Optimization