Abstract:We investigate the task of identifying situations of distracted driving through analysis of in-car videos. To tackle this challenge we introduce a multi-task video transformer that predicts both distracted actions and driver pose. Leveraging VideoMAEv2, a large pre-trained architecture, our approach incorporates semantic information from human keypoint locations to enhance action recognition and decrease computational overhead by minimizing the number of spatio-temporal tokens. By guiding token selection with pose and class information, we notably reduce the model's computational requirements while preserving the baseline accuracy. Our model surpasses existing state-of-the art results in driver action recognition while exhibiting superior efficiency compared to current video transformer-based approaches.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to identify distracted driving situations by analyzing in - vehicle videos. Specifically, the authors introduce a multi - task video transformer, which can simultaneously predict distracted behaviors and the postures of drivers. This method aims to improve the accuracy of action recognition while reducing computational overhead. ### Problem Background Driver distraction is a serious road safety issue. According to Eurostat statistics, 19,917 people were killed in traffic accidents in the European Union in 2021. Although the exact number of accidents caused by distracted driving is unclear, it is estimated that 5 - 25% of European traffic accidents are caused by driver distraction, including behaviors such as using mobile phones or GPS, eating, smoking, or fatigue and stress. Research shows that 68.3% of car accidents show obvious signs of distraction. Therefore, integrating human - machine interface (HMI) technology into advanced driver assistance systems (ADAS) has become crucial. ### Existing Methods and Their Shortcomings Existing methods for driver distraction detection mainly rely on convolutional neural networks (CNN), combinations of CNN and long - short - term memory networks (LSTM), and various Transformer models. Although these methods have made certain progress, there is still a trade - off between accuracy and efficiency. In particular, large - scale video Transformer models are difficult to apply in actual driving scenarios due to their huge computational requirements. ### Solutions Proposed in the Paper To address the above challenges, the authors propose a multi - task video Transformer based on VideoMAEv2 to solve the problem through the following innovations: 1. **Pose - guided multi - task learning**: Use the semantic information of human key - point positions to enhance action recognition and reduce computational overhead by reducing the number of spatio - temporal tokens. 2. **Pose information for token selection**: Guide token selection through pose and category information, significantly reducing GFLOPS (giga - floating - point operations per second) while maintaining the accuracy of the baseline model. 3. **Multi - person representation**: The human pose representation based on key - point heat maps allows for the processing of multi - person data, making this method suitable for multi - actor datasets. 4. **Higher efficiency and accuracy**: This method outperforms the existing state - of - the - art results in driver action recognition and is more efficient than current video - Transformer - based methods. ### Main Contributions - Propose a multi - task video Transformer that uses human key - point position heat maps to select the most informative video tokens. - Introduce pose and category information to guide token selection, significantly reducing computational requirements while maintaining high accuracy. - The human pose representation based on key - point heat maps supports multi - person scenarios. - On the Driver&Act dataset, this method not only improves the accuracy by 8% but also reduces the computational amount by 25%. Through these innovations, this research provides a more practical and efficient solution for driver distraction monitoring.

Pose-guided multi-task video transformer for driver action recognition

Transformer-based Fusion of 2D-pose and Spatio-temporal Embeddings for Distracted Driver Action Recognition

Multi-attribute Adaptive Aggregation Transformer for Vehicle Re-Identification.

PoseViNet: Distracted Driver Action Recognition Framework Using Multi-View Pose Estimation and Vision Transformer

MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition

Video Multitask Transformer Network

Do You Act Like You Talk? Exploring Pose-based Driver Action Classification with Speech Recognition Networks

Multi-scale space-time transformer for driving behavior detection

A Multi-Modal Transformer Network for Action Detection

Efficient Video Transformers via Spatial-Temporal Token Merging for Action Recognition

TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration

ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection

Task-specific alignment and multiple-level transformer for few-shot action recognition

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

Multimodal driver distraction detection using dual-channel network of CNN and Transformer

MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

DSDFormer: An Innovative Transformer-Mamba Framework for Robust High-Precision Driver Distraction Identification

Driver Multi-task Emotion Recognition Network Based on Multi-modal Facial Video Analysis

Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers

ADAPT: Action-aware Driving Caption Transformer.

TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking