EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation

Jun Zhou,Chunsheng Liu,Faliang Chang,Wenqian Wang,Penghui Hao,Yiming Huang,Zhiqiang Yang
2024-11-01
Abstract:Associating driver attention with driving scene across two fields of views (FOVs) is a hard cross-domain perception problem, which requires comprehensive consideration of cross-view mapping, dynamic driving scene analysis, and driver status tracking. Previous methods typically focus on a single view or map attention to the scene via estimated gaze, failing to exploit the implicit connection between them. Moreover, simple fusion modules are insufficient for modeling the complex relationships between the two views, making information integration challenging. To address these issues, we propose a novel method for end-to-end scene-associated driver attention estimation, called EraW-Net. This method enhances the most discriminative dynamic cues, refines feature representations, and facilitates semantically aligned cross-domain integration through a W-shaped architecture, termed W-Net. Specifically, a Dynamic Adaptive Filter Module (DAF-Module) is proposed to address the challenges of frequently changing driving environments by extracting vital regions. It suppresses the indiscriminately recorded dynamics and highlights crucial ones by innovative joint frequency-spatial analysis, enhancing the model's ability to parse complex dynamics. Additionally, to track driver states during non-fixed facial poses, we propose a Global Context Sharing Module (GCS-Module) to construct refined feature representations by capturing hierarchical features that adapt to various scales of head and eye movements. Finally, W-Net achieves systematic cross-view information integration through its "Encoding-Independent Partial Decoding-Fusion Decoding" structure, addressing semantic misalignment in heterogeneous data integration. Experiments demonstrate that the proposed method robustly and accurately estimates the mapping of driver attention in scene on large public datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to correlate the driver's attention with the driving scene, especially in the case of across two fields of views (FOVs). Specifically, the paper focuses on how to effectively capture the driver's attention in a complex and dynamic driving environment and map it to the current driving scene. ### Main Challenges 1. **Cross - Domain Information Integration**: - The driver's attention and road conditions correspond to two different perspectives: in - vehicle and on - road. There is no explicit overlapping information between these two perspectives, making cross - domain information integration an extremely challenging task. 2. **Complex Causal Relationships**: - The driver's attention is affected by traffic conditions, but due to the driver's subjective initiative, their attention may shift frequently without significant changes in the environment. This makes it difficult to clearly define the relationship between changes in the driver's attention and changes in the driving environment. 3. **Dynamic Behaviors and Environmental Changes**: - During driving, the driver's behaviors (such as eye movements and head rotations) and road conditions are constantly changing. These dynamic changes increase the difficulty of accurately tracking the driver's attention. In addition, sudden situations in the driving environment (such as pedestrians or lane changes) require the model to have a rapid adaptation ability. ### Solutions To solve the above problems, the paper proposes EraW - Net (Enhance - Refine - Align W - Net), a new end - to - end method for driver attention estimation in scene correlation. EraW - Net solves these problems in the following ways: 1. **Enhancing Dynamic Cues**: - The Dynamic Adaptive Filter Module (DAF - Module) is proposed. It extracts key dynamic regions through joint frequency - domain - spatial analysis, suppresses unnecessary dynamic information, and highlights important dynamic features. 2. **Refining Feature Representations**: - The Global Context Sharing Module (GCS - Module) is proposed. It refines facial feature representations by capturing multi - scale features to adapt to head and eye movements at different scales. 3. **Semantically Aligned Cross - Domain Information Integration**: - The W - Net architecture is adopted. It systematically integrates complementary information from two inputs through an "encode - independently partial decode - fusion decode" structure, ensuring semantic alignment and improving the stability and performance of the model. ### Summary Through innovative module design and network architecture, EraW - Net effectively solves problems such as cross - domain information integration, dynamic behavior tracking, and complex causal relationship modeling in the correlation between driver attention and driving scenes, achieving more accurate driver attention mapping. ### Formula Display The following are some formulas involved in the paper, presented in Markdown format: 1. **Channel Reduction Unit (CRU)**: \[ \text{CRU}(a_i)=\text{Conv}_{3\times3}(\text{Cat}[\text{Conv}_{1\times1}(a_i),\text{Conv}_{3\times3}(\text{Conv}_{1\times1}(a_i))]) \] 2. **Local Correlation Calculation**: \[ \text{Corr}_{L_n}=\frac{L_{n1}\cdot L_{n2}^T}{\sqrt{C}} \] \[ \tilde{P}_{L_n}=\text{softmax}(\text{Corr}_{L_n})\times P_{L_n} \] 3. **Frequency - Domain Filtering**: \[ D_{fe}=\text{IFFT}(\text{Conv}_{1\times1}(\text{FFT}(D))) \] These formulas show EraW - Net.