Abstract:In recent years, autonomous driving algorithms using low-cost vehicle-mounted cameras have attracted increasing endeavors from both academia and industry. There are multiple fronts to these endeavors, including object detection on roads, 3-D reconstruction etc., but in this work we focus on a vision-based model that directly maps raw input images to steering angles using deep networks. This represents a nascent research topic in computer vision. The technical contributions of this work are three-fold. First, the model is learned and evaluated on real human driving videos that are time-synchronized with other vehicle sensors. This differs from many prior models trained from synthetic data in racing games. Second, state-of-the-art models, such as PilotNet, mostly predict the wheel angles independently on each video frame, which contradicts common understanding of driving as a stateful process. Instead, our proposed model strikes a combination of spatial and temporal cues, jointly investigating instantaneous monocular camera observations and vehicle's historical states. This is in practice accomplished by inserting carefully-designed recurrent units (e.g., LSTM and Conv-LSTM) at proper network layers. Third, to facilitate the interpretability of the learned model, we utilize a visual back-propagation scheme for discovering and visualizing image regions crucially influencing the final steering prediction. Our experimental study is based on about 6 hours of human driving data provided by Udacity. Comprehensive quantitative evaluations demonstrate the effectiveness and robustness of our model, even under scenarios like drastic lighting changes and abrupt turning. The comparison with other state-of-the-art models clearly reveals its superior performance in predicting the due wheel angle for a self-driving car.

Learning Visual Representation for Autonomous Drone Navigation Via a Contrastive World Model

Learning Navigational Visual Representations with Semantic Map Supervision

BEVNav: Robot Autonomous Navigation Via Spatial-Temporal Contrastive Learning in Bird's-Eye View

Learning Deep Sensorimotor Policies for Vision-based Autonomous Drone Racing

DMCL: Robot Autonomous Navigation Via Depth Image Masked Contrastive Learning

Context vector-based visual mapless navigation in indoor using hierarchical semantic information and meta-learning

Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving

Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning

Navigation Command Matching for Vision-based Autonomous Driving

Bird's Eye View Based Pretrained World model for Visual Navigation

Visual Representations for Semantic Target Driven Navigation

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Self-Supervised Representation Learning With Spatial-Temporal Consistency for Sign Language Recognition

Learning Latent Dynamic Robust Representations for World Models

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Learning End-to-End Autonomous Steering Model from Spatial and Temporal Visual Cues

A Data-Efficient Framework for Training and Sim-to-Real Transfer of Navigation Policies