Pose-guided multi-task video transformer for driver action recognition

Ricardo Pizarro,Roberto Valle,Luis Miguel Bergasa,José M. Buenaposada,Luis Baumela
2024-07-19
Abstract:We investigate the task of identifying situations of distracted driving through analysis of in-car videos. To tackle this challenge we introduce a multi-task video transformer that predicts both distracted actions and driver pose. Leveraging VideoMAEv2, a large pre-trained architecture, our approach incorporates semantic information from human keypoint locations to enhance action recognition and decrease computational overhead by minimizing the number of spatio-temporal tokens. By guiding token selection with pose and class information, we notably reduce the model's computational requirements while preserving the baseline accuracy. Our model surpasses existing state-of-the art results in driver action recognition while exhibiting superior efficiency compared to current video transformer-based approaches.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to identify distracted driving situations by analyzing in - vehicle videos. Specifically, the authors introduce a multi - task video transformer, which can simultaneously predict distracted behaviors and the postures of drivers. This method aims to improve the accuracy of action recognition while reducing computational overhead. ### Problem Background Driver distraction is a serious road safety issue. According to Eurostat statistics, 19,917 people were killed in traffic accidents in the European Union in 2021. Although the exact number of accidents caused by distracted driving is unclear, it is estimated that 5 - 25% of European traffic accidents are caused by driver distraction, including behaviors such as using mobile phones or GPS, eating, smoking, or fatigue and stress. Research shows that 68.3% of car accidents show obvious signs of distraction. Therefore, integrating human - machine interface (HMI) technology into advanced driver assistance systems (ADAS) has become crucial. ### Existing Methods and Their Shortcomings Existing methods for driver distraction detection mainly rely on convolutional neural networks (CNN), combinations of CNN and long - short - term memory networks (LSTM), and various Transformer models. Although these methods have made certain progress, there is still a trade - off between accuracy and efficiency. In particular, large - scale video Transformer models are difficult to apply in actual driving scenarios due to their huge computational requirements. ### Solutions Proposed in the Paper To address the above challenges, the authors propose a multi - task video Transformer based on VideoMAEv2 to solve the problem through the following innovations: 1. **Pose - guided multi - task learning**: Use the semantic information of human key - point positions to enhance action recognition and reduce computational overhead by reducing the number of spatio - temporal tokens. 2. **Pose information for token selection**: Guide token selection through pose and category information, significantly reducing GFLOPS (giga - floating - point operations per second) while maintaining the accuracy of the baseline model. 3. **Multi - person representation**: The human pose representation based on key - point heat maps allows for the processing of multi - person data, making this method suitable for multi - actor datasets. 4. **Higher efficiency and accuracy**: This method outperforms the existing state - of - the - art results in driver action recognition and is more efficient than current video - Transformer - based methods. ### Main Contributions - Propose a multi - task video Transformer that uses human key - point position heat maps to select the most informative video tokens. - Introduce pose and category information to guide token selection, significantly reducing computational requirements while maintaining high accuracy. - The human pose representation based on key - point heat maps supports multi - person scenarios. - On the Driver&Act dataset, this method not only improves the accuracy by 8% but also reduces the computational amount by 25%. Through these innovations, this research provides a more practical and efficient solution for driver distraction monitoring.