Abstract:Driving can take up a substantial part of daily life and frequently trigger negative emotions like anger or anxiety, which have a significant adverse impact on driving safety as well as long-term human health. To identify driver emotions, thereby improving the safety and humanization of intelligent driving, we explore how to model the discriminative emotion features from both speech and facial expressions in this work. More specifically, an effective attention-based network for facial expression and a lightweight speech emotion network are proposed, separately. Then, audio and video features are combined at the feature level to construct our multimodal driver emotion recognition model. This paper proposes a new audio feature extractor that uses a multi-scale residual structure to extract spectrogram features. In terms of video, a set of frame sequences using Local Binary Pattern Histograms (LBPH) is obtained through preprocessing, which generates a fixed-dimensional feature representation. These features are then input into a fine-tuned ResNet18 model to analyze spatial information. This model is further augmented by integrating both a temporal attention module and a Gated Recurrent Unit (GRU), enhancing its capability to create a highly discriminative video representation. Additionally, we propose an Internet of Vehicles (IoV) platform, specifically designed for driver emotion recognition. The IoV platform consists of sensor layer, data acquisition and transport layer, server layer and data application layer. The IoV platform uses sensors to collect multimodal data from drivers, which can provide data support for the proposed multimodal driver emotion recognition algorithm. The performance of this proposed algorithm is evaluated on two multimodal emotional datasets, Ryerson Audio-Visual Dataset of Emotional Speech and Song (RAVDESS) and Surrey Audio-Visual Expressed Emotion (SAVEE), using a variety of performance indicators. Compared to other baseline methods, this proposed multimodal model achieves state-of-the-art results on the RAVDESS and SAVEE datasets, demonstrating superior recognition accuracy with rates of 0.93 and 0.99, respectively. Additionally, it exhibits precision scores of 0.93 on RAVDESS and 0.99 on SAVEE, along with exceptional specificity scores of 0.99 and 1.00, respectively.

Driver Multi-task Emotion Recognition Network Based on Multi-modal Facial Video Analysis

Drivers' Comprehensive Emotion Recognition Based on HAM

A Multimodal Driver Emotion Recognition Algorithm Based on the Audio and Video Signals in Internet of Vehicles Platform

Driver Emotion Recognition with a Hybrid Attentional Multimodal Fusion Framework

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Emotion Recognition in Videos via Fusing Multimodal Features.

Driver Emotion Recognition Involving Multimodal Signals: Electrophysiological Response, Nasal-Tip Temperature, and Vehicle Behavior

Multimodal driver emotion recognition using motor activity and facial expressions

On-Road Driver Emotion Recognition Using Facial Expression

Driver Emotion and Fatigue State Detection Based on Time Series Fusion

A Convolution Bidirectional Long Short-Term Memory Neural Network for Driver Emotion Recognition

MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition

Multi-modal emotion analysis from facial expressions and electroencephalogram.

Video-based driver emotion recognition using hybrid deep spatio-temporal feature learning

A Unified Multi-scale and Multi-task Learning Framework for Driver Behaviors Reasoning

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

Research on driver's anger recognition method based on multimodal data fusion

DERNet: Driver Emotion Recognition Using Onboard Camera

Multi-modal emotion recognition using tensor decomposition fusion and self-supervised multi-tasking

CogEmoNet: A Cognitive-Feature-Augmented Driver Emotion Recognition Model for Smart Cockpit

Driver Emotion Recognition Of Multiple-Ecg Feature Fusion Based On Bp Network And D-S Evidence