Abstract:Associating driver attention with driving scene across two fields of views (FOVs) is a hard cross-domain perception problem, which requires comprehensive consideration of cross-view mapping, dynamic driving scene analysis, and driver status tracking. Previous methods typically focus on a single view or map attention to the scene via estimated gaze, failing to exploit the implicit connection between them. Moreover, simple fusion modules are insufficient for modeling the complex relationships between the two views, making information integration challenging. To address these issues, we propose a novel method for end-to-end scene-associated driver attention estimation, called EraW-Net. This method enhances the most discriminative dynamic cues, refines feature representations, and facilitates semantically aligned cross-domain integration through a W-shaped architecture, termed W-Net. Specifically, a Dynamic Adaptive Filter Module (DAF-Module) is proposed to address the challenges of frequently changing driving environments by extracting vital regions. It suppresses the indiscriminately recorded dynamics and highlights crucial ones by innovative joint frequency-spatial analysis, enhancing the model's ability to parse complex dynamics. Additionally, to track driver states during non-fixed facial poses, we propose a Global Context Sharing Module (GCS-Module) to construct refined feature representations by capturing hierarchical features that adapt to various scales of head and eye movements. Finally, W-Net achieves systematic cross-view information integration through its "Encoding-Independent Partial Decoding-Fusion Decoding" structure, addressing semantic misalignment in heterogeneous data integration. Experiments demonstrate that the proposed method robustly and accurately estimates the mapping of driver attention in scene on large public datasets.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to correlate the driver's attention with the driving scene, especially in the case of across two fields of views (FOVs). Specifically, the paper focuses on how to effectively capture the driver's attention in a complex and dynamic driving environment and map it to the current driving scene. ### Main Challenges 1. **Cross - Domain Information Integration**: - The driver's attention and road conditions correspond to two different perspectives: in - vehicle and on - road. There is no explicit overlapping information between these two perspectives, making cross - domain information integration an extremely challenging task. 2. **Complex Causal Relationships**: - The driver's attention is affected by traffic conditions, but due to the driver's subjective initiative, their attention may shift frequently without significant changes in the environment. This makes it difficult to clearly define the relationship between changes in the driver's attention and changes in the driving environment. 3. **Dynamic Behaviors and Environmental Changes**: - During driving, the driver's behaviors (such as eye movements and head rotations) and road conditions are constantly changing. These dynamic changes increase the difficulty of accurately tracking the driver's attention. In addition, sudden situations in the driving environment (such as pedestrians or lane changes) require the model to have a rapid adaptation ability. ### Solutions To solve the above problems, the paper proposes EraW - Net (Enhance - Refine - Align W - Net), a new end - to - end method for driver attention estimation in scene correlation. EraW - Net solves these problems in the following ways: 1. **Enhancing Dynamic Cues**: - The Dynamic Adaptive Filter Module (DAF - Module) is proposed. It extracts key dynamic regions through joint frequency - domain - spatial analysis, suppresses unnecessary dynamic information, and highlights important dynamic features. 2. **Refining Feature Representations**: - The Global Context Sharing Module (GCS - Module) is proposed. It refines facial feature representations by capturing multi - scale features to adapt to head and eye movements at different scales. 3. **Semantically Aligned Cross - Domain Information Integration**: - The W - Net architecture is adopted. It systematically integrates complementary information from two inputs through an "encode - independently partial decode - fusion decode" structure, ensuring semantic alignment and improving the stability and performance of the model. ### Summary Through innovative module design and network architecture, EraW - Net effectively solves problems such as cross - domain information integration, dynamic behavior tracking, and complex causal relationship modeling in the correlation between driver attention and driving scenes, achieving more accurate driver attention mapping. ### Formula Display The following are some formulas involved in the paper, presented in Markdown format: 1. **Channel Reduction Unit (CRU)**: \[ \text{CRU}(a_i)=\text{Conv}_{3\times3}(\text{Cat}[\text{Conv}_{1\times1}(a_i),\text{Conv}_{3\times3}(\text{Conv}_{1\times1}(a_i))]) \] 2. **Local Correlation Calculation**: \[ \text{Corr}_{L_n}=\frac{L_{n1}\cdot L_{n2}^T}{\sqrt{C}} \] \[ \tilde{P}_{L_n}=\text{softmax}(\text{Corr}_{L_n})\times P_{L_n} \] 3. **Frequency - Domain Filtering**: \[ D_{fe}=\text{IFFT}(\text{Conv}_{1\times1}(\text{FFT}(D))) \] These formulas show EraW - Net.

EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation

A Fusion Method Aiming at Environmental Perception of Autonomous Vehicle Based on Visual Scheme

PerimetryNet: A Multiscale Fine Grained Deep Network for Three-Dimensional Eye Gaze Estimation Using Visual Field Analysis

Unifying Terrain Awareness Through Real-Time Semantic Segmentation

All-day perception for intelligent vehicles: switching perception algorithms based on WBCNet

SADNet: Sustained Attention Decoding in a Driving Task by Self-Attention Convolutional Neural Network

Multisource Adaption for Driver Attention Prediction in Arbitrary Driving Scenes

E-DNet: An End-to-End Dual-Branch Network for Driver Steering Intention Detection

All in One Network for Driver Attention Monitoring

Driver Drowsiness Detection Using EEG and EOG with an Attention-CNN Framework

Perceive, Attend, and Drive: Learning Spatial Attention for Safe Self-Driving

STDA: Spatio-Temporal Dual-Encoder Network Incorporating Driver Attention to Predict Driver Behaviors Under Safety-Critical Scenarios

Improving real-time driver distraction detection via constrained attention mechanism

MMFN: Multi-Modal-Fusion-Net for End-to-End Driving

Fusion of Gaze and Scene Information for Driving Behaviour Recognition: A Graph-Neural-Network-Based Framework

Driver inattention monitoring system based on multimodal fusion with visual cues to improve driving safety

FBLNet: FeedBack Loop Network for Driver Attention Prediction

RGB and LiDAR Fusion-based 3D Semantic Segmentation for Autonomous Driving

Object-Level Attention Prediction for Drivers in the Information-Rich Traffic Environment

Platelet release products modulate some aspects of polymorphonuclear leukocyte activation

NDNet: Spacewise Multiscale Representation Learning via Neighbor Decoupling for Real-Time Driving Scene Parsing