Abstract:With the increased importance of autonomous navigation systems has come an increasing need to protect the safety of Vulnerable Road Users (VRUs) such as pedestrians. Predicting pedestrian intent is one such challenging task, where prior work predicts the binary cross/no-cross intention with a fusion of visual and motion features. However, there has been no effort so far to hedge such predictions with human-understandable reasons. We address this issue by introducing a novel problem setting of exploring the intuitive reasoning behind a pedestrian's intent. In particular, we show that predicting the 'WHY' can be very useful in understanding the 'WHAT'. To this end, we propose a novel, reason-enriched PIE++ dataset consisting of multi-label textual explanations/reasons for pedestrian intent. We also introduce a novel multi-task learning framework called MINDREAD, which leverages a cross-modal representation learning framework for predicting pedestrian intent as well as the reason behind the intent. Our comprehensive experiments show significant improvement of 5.6% and 7% in accuracy and F1-score for the task of intent prediction on the PIE++ dataset using MINDREAD. We also achieved a 4.4% improvement in accuracy on a commonly used JAAD dataset. Extensive evaluation using quantitative/qualitative metrics and user studies shows the effectiveness of our approach.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to improve pedestrian intention prediction by introducing interpretable reasoning, especially in order to enhance the safety protection for vulnerable road users (VRUs) such as pedestrians**. Specifically, the existing pedestrian intention prediction methods mainly rely on the fusion of visual and motion features to predict whether a pedestrian will cross the road (a binary classification problem: cross the road / not cross the road). However, these methods lack providing human - understandable reasons for the prediction results. This paper proposes a new problem setting, that is, exploring the intuitive reasoning behind pedestrian intentions, and shows that predicting "WHY" can help to better understand "WHAT", thereby improving the accuracy of pedestrian intention prediction. ### Main Contributions 1. **Introducing Pedestrian Intention Reasoning**: - A multi - task learning framework named MINDREAD is proposed, which combines cross - modal representation learning and uses visual and language modules to simultaneously predict pedestrian intentions and the reasons behind them. 2. **Creating the PIE++ Dataset**: - Enrich the existing PIE dataset by adding multi - label text explanations / reason annotations, so that the dataset contains not only the behavior and intention information of pedestrians, but also the explanatory reasons behind them. This provides more useful information for researchers. 3. **Proposing a New Cross - Modal Learning Framework MINDREAD**: - Use semantic correlation and attention mechanisms to fuse visual spatio - temporal features and text explanation embeddings to capture pedestrian intentions and improve prediction performance. ### Experimental Results - On the PIE++ dataset, MINDREAD improves the accuracy rate by 5.6% and the F1 score by 7% respectively in the intention prediction task. - On the commonly - used JAAD dataset, MINDREAD also achieves a 4.4% improvement in accuracy rate. ### Method Overview The MINDREAD framework consists of three modules: 1. **Correlated Semantic Explanation Affinity (CSEA)**: - Build a directed graph, where each node is a reason (text embedding), and the edges represent their co - occurrence relationships in the dataset. Use GCN to learn the final reason embedding. 2. **Transformer - Based Feature Encoding (TFE)**: - Extract and encode the local and global visual context features in video frames as well as the bounding box information of pedestrians, and use Swin - V2 Transformer for feature extraction. 3. **Attention - Mechanism - Based Cross - Modal Representation Learning**: - Combine the outputs of the TFE and CSEA modules through the attention mechanism to generate the final cross - modal representation for predicting pedestrian intentions and reasons. In conclusion, this paper significantly improves the accuracy and interpretability of pedestrian intention prediction by introducing the explanatory reasons for pedestrian intentions, providing important support for the safety of autonomous driving systems and advanced driver assistance systems (ADAS).

Can Reasons Help Improve Pedestrian Intent Estimation? A Cross-Modal Approach

See Extensively While Focusing on the Core Area for Pedestrian Detection.

Diving Deeper Into Pedestrian Behavior Understanding: Intention Estimation, Action Prediction, and Event Risk Assessment

MindReaD: Enhancing Pedestrian-Vehicle Interaction with Micro-Level Reasoning Data Annotation

Coupling Intent and Action for Pedestrian Crossing Behavior Prediction

Pedestrian Intention Prediction for Autonomous Vehicles: A Comprehensive Survey

Bifold and Semantic Reasoning for Pedestrian Behavior Prediction

Feature Importance in Pedestrian Intention Prediction: A Context-Aware Review

Experimental Insights Towards Explainable and Interpretable Pedestrian Crossing Prediction

Context-aware Multi-task Learning for Pedestrian Intent and Trajectory Prediction

Local and Global Contextual Features Fusion for Pedestrian Intention Prediction

Pedestrian Intention Prediction: A Multi-task Perspective

Action-ViT: Pedestrian Intent Prediction in Traffic Scenes

Intention Recognition of Pedestrians and Cyclists by 2D Pose Estimation

Crossmodal Transformer Based Generative Framework for Pedestrian Trajectory Prediction

A low complexity contextual stacked ensemble-learning approach for pedestrian intent prediction

Context Model for Pedestrian Intention Prediction using Factored Latent-Dynamic Conditional Random Fields

Pedestrian Crossing Intention Forecasting at Unsignalized Intersections Using Naturalistic Trajectories

PIP-Net: Pedestrian Intention Prediction in the Wild

PedFormer: Pedestrian Behavior Prediction via Cross-Modal Attention Modulation and Gated Multitask Learning

Applying the Extended Theory of Planned Behavior to Pedestrian Intention Estimation.