Abstract:Pedestrian action prediction is of great significance for many applications such as autonomous driving. However, state-of-the-art methods lack explainability to make trustworthy predictions. In this paper, a novel framework called MulCPred is proposed that explains its predictions based on multi-modal concepts represented by training samples. Previous concept-based methods have limitations including: 1) they cannot directly apply to multi-modal cases; 2) they lack locality to attend to details in the inputs; 3) they suffer from mode collapse. These limitations are tackled accordingly through the following approaches: 1) a linear aggregator to integrate the activation results of the concepts into predictions, which associates concepts of different modalities and provides ante-hoc explanations of the relevance between the concepts and the predictions; 2) a channel-wise recalibration module that attends to local spatiotemporal regions, which enables the concepts with locality; 3) a feature regularization loss that encourages the concepts to learn diverse patterns. MulCPred is evaluated on multiple datasets and tasks. Both qualitative and quantitative results demonstrate that MulCPred is promising in improving the explainability of pedestrian action prediction without obvious performance degradation. Furthermore, by removing unrecognizable concepts from MulCPred, the cross-dataset prediction performance is improved, indicating the feasibility of further generalizability of MulCPred.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper attempts to address the issue of interpretability in pedestrian behavior prediction. Specifically, while existing pedestrian behavior prediction methods have made significant progress in performance, they lack interpretability, making it difficult for these models to gain user trust. This paper proposes a new framework called MulCPred, which predicts pedestrian behavior based on multimodal concepts and provides interpretable prediction results. ### Main Issues and Challenges 1. **Multimodal Data Processing**: Existing methods mainly handle single-modal data, such as images or speech signals, lacking the ability to effectively integrate information from different modalities. 2. **Lack of Locality**: Existing methods can only provide explanations at the sample level and cannot focus on local spatiotemporal details in the input data. 3. **Mode Collapse**: Existing methods sometimes suffer from mode collapse, where the learned concepts are overly singular, severely affecting interpretability. ### Solutions To overcome the above issues, MulCPred proposes the following methods: 1. **Linear Aggregator**: Integrates the concept activation results of different modalities into prediction results through a linear aggregator, associating concepts from different modalities and providing prior explanations. 2. **Channel Recalibration Module**: Focuses on local spatiotemporal regions in the input data through a channel recalibration module, giving concepts locality. 3. **Feature Regularization Loss**: Introduces a feature regularization loss term to encourage the learning of diverse patterns in concepts, preventing mode collapse. ### Experimental Results MulCPred was evaluated on multiple datasets and tasks, including pedestrian crossing prediction and atomic action prediction. Experimental results show that MulCPred is not only competitive in prediction performance but also significantly improves the interpretability of predictions. Additionally, by removing incomprehensible concepts, MulCPred's cross-dataset prediction performance is also enhanced, indicating the potential for further generalization of the framework. ### Conclusion By introducing multimodal concepts and improved interpretability mechanisms, MulCPred successfully addresses the interpretability issue in pedestrian behavior prediction, providing more trustworthy prediction results for applications such as autonomous driving.

MulCPred: Learning Multi-modal Concepts for Explainable Pedestrian Action Prediction

MulCPred: Learning Multi-Modal Concepts for Explainable Pedestrian Action Prediction

Towards Accurate Dense Pedestrian Detection Via Occlusion-Prediction Aware Label Assignment and Hierarchical-Nms.

See Extensively While Focusing on the Core Area for Pedestrian Detection.

Sparse Prototype Network for Explainable Pedestrian Behavior Prediction

PedFormer: Pedestrian Behavior Prediction via Cross-Modal Attention Modulation and Gated Multitask Learning

Multi-Modal Hybrid Architecture for Pedestrian Action Prediction

Crossmodal Transformer Based Generative Framework for Pedestrian Trajectory Prediction

Experimental Insights Towards Explainable and Interpretable Pedestrian Crossing Prediction

CR-LSTM: Collision-prior Guided Social Refinement for Pedestrian Trajectory Prediction

Locality guided cross-modal feature aggregation and pixel-level fusion for multispectral pedestrian detection

Bifold and Semantic Reasoning for Pedestrian Behavior Prediction

GTransPDM: A Graph-embedded Transformer with Positional Decoupling for Pedestrian Crossing Intention Prediction

Coupling Intent and Action for Pedestrian Crossing Behavior Prediction

When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset

Multi-Relational Pedestrian Trajectory Prediction in Complex Scenes.

Application of Multi-Feature Fusion Based on Deep Learning in Pedestrian Re-Recognition Method

Learning a Dynamic Cross-Modal Network for Multispectral Pedestrian Detection

Spatio-Contextual Deep Network Based Multimodal Pedestrian Detection For Autonomous Driving

Learning Transferable Pedestrian Representation from Multimodal Information Supervision

PePScenes: A Novel Dataset and Baseline for Pedestrian Action Prediction in 3D