Abstract:Driver distraction causes a significant number of traffic accidents every year, resulting in economic losses and casualties. Currently, the level of automation in commercial vehicles is far from completely unmanned, and drivers still play an important role in operating and controlling the vehicle. Therefore, driver distraction behavior detection is crucial for road safety. At present, driver distraction detection primarily relies on traditional convolutional neural networks (CNN) and supervised learning methods. However, there are still challenges such as the high cost of labeled datasets, limited ability to capture high-level semantic information, and weak generalization performance. In order to solve these problems, this paper proposes a new self-supervised learning method based on masked image modeling for driver distraction behavior detection. Firstly, a self-supervised learning framework for masked image modeling (MIM) is introduced to solve the serious human and material consumption issues caused by dataset labeling. Secondly, the Swin Transformer is employed as an encoder. Performance is enhanced by reconfiguring the Swin Transformer block and adjusting the distribution of the number of window multi-head self-attention (W-MSA) and shifted window multi-head self-attention (SW-MSA) detection heads across all stages, which leads to model more lightening. Finally, various data augmentation strategies are used along with the best random masking strategy to strengthen the model's recognition and generalization ability. Test results on a large-scale driver distraction behavior dataset show that the self-supervised learning method proposed in this paper achieves an accuracy of 99.60%, approximating the excellent performance of advanced supervised learning methods. Our code is publicly available at <a class="link-external link-http" href="http://github.com/Rocky1salady-killer/SL-DDBD" rel="external noopener nofollow">this http URL</a>.

ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection

DSDFormer: An Innovative Transformer-Mamba Framework for Robust High-Precision Driver Distraction Identification

Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers

Multimodal driver distraction detection using dual-channel network of CNN and Transformer

PoseViNet: Distracted Driver Action Recognition Framework Using Multi-View Pose Estimation and Vision Transformer

L-TLA: A Lightweight Driver Distraction Detection Method Based on Three-Level Attention Mechanisms

Driver Vigilance Detection from EEG Signals Using Transformer Networks

Multi-scale space-time transformer for driving behavior detection

Towards Infusing Auxiliary Knowledge for Distracted Driver Detection

Pose-guided multi-task video transformer for driver action recognition

A Novel Driver Distraction Behavior Detection Method Based on Self-supervised Learning with Masked Image Modeling

Driver Multi-task Emotion Recognition Network Based on Multi-modal Facial Video Analysis

Improving real-time driver distraction detection via constrained attention mechanism

M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Multi Self-supervised Pre-fine-tuned Transformer Fusion for Better Intelligent Transportation Detection

VTD: Visual and Tactile Database for Driver State and Behavior Perception

Towards Sustainable Safe Driving: A Multimodal Fusion Method for Risk Level Recognition in Distracted Driving Status

Efficient Vision Transformer for Accurate Traffic Sign Detection

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Improving automatic detection of driver fatigue and distraction using machine learning