Can Reasons Help Improve Pedestrian Intent Estimation? A Cross-Modal Approach

Vaishnavi Khindkar,Vineeth Balasubramanian,Chetan Arora,Anbumani Subramanian,C.V. Jawahar
2024-11-20
Abstract:With the increased importance of autonomous navigation systems has come an increasing need to protect the safety of Vulnerable Road Users (VRUs) such as pedestrians. Predicting pedestrian intent is one such challenging task, where prior work predicts the binary cross/no-cross intention with a fusion of visual and motion features. However, there has been no effort so far to hedge such predictions with human-understandable reasons. We address this issue by introducing a novel problem setting of exploring the intuitive reasoning behind a pedestrian's intent. In particular, we show that predicting the 'WHY' can be very useful in understanding the 'WHAT'. To this end, we propose a novel, reason-enriched PIE++ dataset consisting of multi-label textual explanations/reasons for pedestrian intent. We also introduce a novel multi-task learning framework called MINDREAD, which leverages a cross-modal representation learning framework for predicting pedestrian intent as well as the reason behind the intent. Our comprehensive experiments show significant improvement of 5.6% and 7% in accuracy and F1-score for the task of intent prediction on the PIE++ dataset using MINDREAD. We also achieved a 4.4% improvement in accuracy on a commonly used JAAD dataset. Extensive evaluation using quantitative/qualitative metrics and user studies shows the effectiveness of our approach.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to improve pedestrian intention prediction by introducing interpretable reasoning, especially in order to enhance the safety protection for vulnerable road users (VRUs) such as pedestrians**. Specifically, the existing pedestrian intention prediction methods mainly rely on the fusion of visual and motion features to predict whether a pedestrian will cross the road (a binary classification problem: cross the road / not cross the road). However, these methods lack providing human - understandable reasons for the prediction results. This paper proposes a new problem setting, that is, exploring the intuitive reasoning behind pedestrian intentions, and shows that predicting "WHY" can help to better understand "WHAT", thereby improving the accuracy of pedestrian intention prediction. ### Main Contributions 1. **Introducing Pedestrian Intention Reasoning**: - A multi - task learning framework named MINDREAD is proposed, which combines cross - modal representation learning and uses visual and language modules to simultaneously predict pedestrian intentions and the reasons behind them. 2. **Creating the PIE++ Dataset**: - Enrich the existing PIE dataset by adding multi - label text explanations / reason annotations, so that the dataset contains not only the behavior and intention information of pedestrians, but also the explanatory reasons behind them. This provides more useful information for researchers. 3. **Proposing a New Cross - Modal Learning Framework MINDREAD**: - Use semantic correlation and attention mechanisms to fuse visual spatio - temporal features and text explanation embeddings to capture pedestrian intentions and improve prediction performance. ### Experimental Results - On the PIE++ dataset, MINDREAD improves the accuracy rate by 5.6% and the F1 score by 7% respectively in the intention prediction task. - On the commonly - used JAAD dataset, MINDREAD also achieves a 4.4% improvement in accuracy rate. ### Method Overview The MINDREAD framework consists of three modules: 1. **Correlated Semantic Explanation Affinity (CSEA)**: - Build a directed graph, where each node is a reason (text embedding), and the edges represent their co - occurrence relationships in the dataset. Use GCN to learn the final reason embedding. 2. **Transformer - Based Feature Encoding (TFE)**: - Extract and encode the local and global visual context features in video frames as well as the bounding box information of pedestrians, and use Swin - V2 Transformer for feature extraction. 3. **Attention - Mechanism - Based Cross - Modal Representation Learning**: - Combine the outputs of the TFE and CSEA modules through the attention mechanism to generate the final cross - modal representation for predicting pedestrian intentions and reasons. In conclusion, this paper significantly improves the accuracy and interpretability of pedestrian intention prediction by introducing the explanatory reasons for pedestrian intentions, providing important support for the safety of autonomous driving systems and advanced driver assistance systems (ADAS).