MulCPred: Learning Multi-modal Concepts for Explainable Pedestrian Action Prediction

Yan Feng,Alexander Carballo,Keisuke Fujii,Robin Karlsson,Ming Ding,Kazuya Takeda
2024-09-14
Abstract:Pedestrian action prediction is of great significance for many applications such as autonomous driving. However, state-of-the-art methods lack explainability to make trustworthy predictions. In this paper, a novel framework called MulCPred is proposed that explains its predictions based on multi-modal concepts represented by training samples. Previous concept-based methods have limitations including: 1) they cannot directly apply to multi-modal cases; 2) they lack locality to attend to details in the inputs; 3) they suffer from mode collapse. These limitations are tackled accordingly through the following approaches: 1) a linear aggregator to integrate the activation results of the concepts into predictions, which associates concepts of different modalities and provides ante-hoc explanations of the relevance between the concepts and the predictions; 2) a channel-wise recalibration module that attends to local spatiotemporal regions, which enables the concepts with locality; 3) a feature regularization loss that encourages the concepts to learn diverse patterns. MulCPred is evaluated on multiple datasets and tasks. Both qualitative and quantitative results demonstrate that MulCPred is promising in improving the explainability of pedestrian action prediction without obvious performance degradation. Furthermore, by removing unrecognizable concepts from MulCPred, the cross-dataset prediction performance is improved, indicating the feasibility of further generalizability of MulCPred.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper attempts to address the issue of interpretability in pedestrian behavior prediction. Specifically, while existing pedestrian behavior prediction methods have made significant progress in performance, they lack interpretability, making it difficult for these models to gain user trust. This paper proposes a new framework called MulCPred, which predicts pedestrian behavior based on multimodal concepts and provides interpretable prediction results. ### Main Issues and Challenges 1. **Multimodal Data Processing**: Existing methods mainly handle single-modal data, such as images or speech signals, lacking the ability to effectively integrate information from different modalities. 2. **Lack of Locality**: Existing methods can only provide explanations at the sample level and cannot focus on local spatiotemporal details in the input data. 3. **Mode Collapse**: Existing methods sometimes suffer from mode collapse, where the learned concepts are overly singular, severely affecting interpretability. ### Solutions To overcome the above issues, MulCPred proposes the following methods: 1. **Linear Aggregator**: Integrates the concept activation results of different modalities into prediction results through a linear aggregator, associating concepts from different modalities and providing prior explanations. 2. **Channel Recalibration Module**: Focuses on local spatiotemporal regions in the input data through a channel recalibration module, giving concepts locality. 3. **Feature Regularization Loss**: Introduces a feature regularization loss term to encourage the learning of diverse patterns in concepts, preventing mode collapse. ### Experimental Results MulCPred was evaluated on multiple datasets and tasks, including pedestrian crossing prediction and atomic action prediction. Experimental results show that MulCPred is not only competitive in prediction performance but also significantly improves the interpretability of predictions. Additionally, by removing incomprehensible concepts, MulCPred's cross-dataset prediction performance is also enhanced, indicating the potential for further generalization of the framework. ### Conclusion By introducing multimodal concepts and improved interpretability mechanisms, MulCPred successfully addresses the interpretability issue in pedestrian behavior prediction, providing more trustworthy prediction results for applications such as autonomous driving.