A Multi-Modal Explainability Approach for Human-Aware Robots in Multi-Party Conversation

Iveta Bečková,Štefan Pócoš,Giulia Belgiovine,Marco Matarese,Alessandra Sciutti,Carlo Mazzola
2024-05-20
Abstract:The addressee estimation (understanding to whom somebody is talking) is a fundamental task for human activity recognition in multi-party conversation scenarios. Specifically, in the field of human-robot interaction, it becomes even more crucial to enable social robots to participate in such interactive contexts. However, it is usually implemented as a binary classification task, restricting the robot's capability to estimate whether it was addressed and limiting its interactive skills. For a social robot to gain the trust of humans, it is also important to manifest a certain level of transparency and explainability. Explainable artificial intelligence thus plays a significant role in the current machine learning applications and models, to provide explanations for their decisions besides excellent performance. In our work, we a) present an addressee estimation model with improved performance in comparison with the previous SOTA; b) further modify this model to include inherently explainable attention-based segments; c) implement the explainable addressee estimation as part of a modular cognitive architecture for multi-party conversation in an iCub robot; d) propose several ways to incorporate explainability and transparency in the aforementioned architecture; and e) perform a pilot user study to analyze the effect of various explanations on how human participants perceive the robot.
Artificial Intelligence,Computation and Language,Human-Computer Interaction,Machine Learning,Robotics,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in multi - person conversation scenarios, how can social robots more accurately identify the target object of the speaker (i.e., listener estimation), and do so in an interpretable and transparent manner. Specifically, the author proposes a multimodal interpretability method, aiming at: 1. **Improving the performance of the listener estimation model**: Improve the existing state - of - the - art listener estimation model to significantly enhance its accuracy. 2. **Introducing an interpretable attention mechanism**: By adding an attention mechanism to the model, the robot can not only make decisions but also explain its decision - making process. 3. **Implementing a multimodal cognitive architecture**: Integrate the improved model into a modular cognitive architecture, enabling the iCub robot to participate in multi - person conversations and provide real - time behavioral explanations. 4. **Evaluating user acceptance**: Analyze the impact of different explanation methods (such as language, body movements, visual) on human participants' perception of the robot through user studies. ### Background of Key Issues - **Importance of listener estimation**: In human - robot interaction, understanding who the speaker's audience is is crucial for smooth conversations. Most existing models can only perform binary classification (whether it is aimed at the robot or not) and cannot identify specific audiences, which limits the application of robots in complex conversation scenarios. - **Requirement for interpretability and transparency**: In order to build human trust in robots, robots not only need to perform well but also be able to explain their behaviors and decisions. Especially in tasks involving social interactions, transparency and interpretability are particularly important. ### Method Overview The paper proposes two main steps to solve the above problems: 1. **Design an improved listener estimation model (IAE model)**: - Use the Vernissage dataset to train and optimize the model, and improve classification accuracy through hyperparameter search and cross - validation. - Reduce the number of model parameters from approximately 91.7 million to 678,000 while maintaining or improving performance. 2. **Introducing an interpretable attention mechanism (XAE model)**: - Use visual transformers and MLPs to process facial and pose information respectively. - Calculate the importance score of each time frame through a custom - made attention mechanism, thereby generating interpretable intermediate representations. - Provide three types of explanations: image saliency maps, comparison of the importance of facial and pose information, and importance scores of time frames. ### User Evaluation Finally, the author deploys the XAE model in a modular cognitive architecture and conducts a user study to evaluate the impact of different explanation methods on user perception. These works jointly promote the practical application ability of robots in multi - person conversation scenarios. Through this method, the paper not only improves the accuracy of listener estimation but also provides a more transparent and interpretable interaction method for robots, enhancing the trust and interaction effect between humans and robots.