Abstract:The addressee estimation (understanding to whom somebody is talking) is a fundamental task for human activity recognition in multi-party conversation scenarios. Specifically, in the field of human-robot interaction, it becomes even more crucial to enable social robots to participate in such interactive contexts. However, it is usually implemented as a binary classification task, restricting the robot's capability to estimate whether it was addressed and limiting its interactive skills. For a social robot to gain the trust of humans, it is also important to manifest a certain level of transparency and explainability. Explainable artificial intelligence thus plays a significant role in the current machine learning applications and models, to provide explanations for their decisions besides excellent performance. In our work, we a) present an addressee estimation model with improved performance in comparison with the previous SOTA; b) further modify this model to include inherently explainable attention-based segments; c) implement the explainable addressee estimation as part of a modular cognitive architecture for multi-party conversation in an iCub robot; d) propose several ways to incorporate explainability and transparency in the aforementioned architecture; and e) perform a pilot user study to analyze the effect of various explanations on how human participants perceive the robot.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in multi - person conversation scenarios, how can social robots more accurately identify the target object of the speaker (i.e., listener estimation), and do so in an interpretable and transparent manner. Specifically, the author proposes a multimodal interpretability method, aiming at: 1. **Improving the performance of the listener estimation model**: Improve the existing state - of - the - art listener estimation model to significantly enhance its accuracy. 2. **Introducing an interpretable attention mechanism**: By adding an attention mechanism to the model, the robot can not only make decisions but also explain its decision - making process. 3. **Implementing a multimodal cognitive architecture**: Integrate the improved model into a modular cognitive architecture, enabling the iCub robot to participate in multi - person conversations and provide real - time behavioral explanations. 4. **Evaluating user acceptance**: Analyze the impact of different explanation methods (such as language, body movements, visual) on human participants' perception of the robot through user studies. ### Background of Key Issues - **Importance of listener estimation**: In human - robot interaction, understanding who the speaker's audience is is crucial for smooth conversations. Most existing models can only perform binary classification (whether it is aimed at the robot or not) and cannot identify specific audiences, which limits the application of robots in complex conversation scenarios. - **Requirement for interpretability and transparency**: In order to build human trust in robots, robots not only need to perform well but also be able to explain their behaviors and decisions. Especially in tasks involving social interactions, transparency and interpretability are particularly important. ### Method Overview The paper proposes two main steps to solve the above problems: 1. **Design an improved listener estimation model (IAE model)**: - Use the Vernissage dataset to train and optimize the model, and improve classification accuracy through hyperparameter search and cross - validation. - Reduce the number of model parameters from approximately 91.7 million to 678,000 while maintaining or improving performance. 2. **Introducing an interpretable attention mechanism (XAE model)**: - Use visual transformers and MLPs to process facial and pose information respectively. - Calculate the importance score of each time frame through a custom - made attention mechanism, thereby generating interpretable intermediate representations. - Provide three types of explanations: image saliency maps, comparison of the importance of facial and pose information, and importance scores of time frames. ### User Evaluation Finally, the author deploys the XAE model in a modular cognitive architecture and conducts a user study to evaluate the impact of different explanation methods on user perception. These works jointly promote the practical application ability of robots in multi - person conversation scenarios. Through this method, the paper not only improves the accuracy of listener estimation but also provides a more transparent and interpretable interaction method for robots, enhancing the trust and interaction effect between humans and robots.

A Multi-Modal Explainability Approach for Human-Aware Robots in Multi-Party Conversation

Analysing Explanation-Related Interactions in Collaborative Perception-Cognition-Communication-Action

Explain yourself! Effects of Explanations in Human-Robot Interaction

A User-Centred Framework for Explainable Artificial Intelligence in Human-Robot Interaction

To Whom are You Talking? A Deep Learning Model to Endow Social Robots with Addressee Estimation Skills

Explainable Representations of the Social State: A Model for Social Human-Robot Interactions

A Surrogate Model Framework for Explainable Autonomous Behaviour

Multi-Agent Strategy Explanations for Human-Robot Collaboration

Self-Explaining Social Robots: An Explainable Behavior Generation Architecture for Human-Robot Interaction

A Tale of Two Explanations: Enhancing Human Trust by Explaining Robot Behavior.

Let people fail! Exploring the influence of explainable virtual and robotic agents in learning-by-doing tasks

Joint Mind Modeling for Explanation Generation in Complex Human-Robot Collaborative Tasks

CE-MRS: Contrastive Explanations for Multi-Robot Systems

Tell and show: Combining multiple modalities to communicate manipulation tasks to a robot

Enhancing Human–Robot Collaboration through a Multi-Module Interaction Framework with Sensor Fusion: Object Recognition, Verbal Communication, User of Interest Detection, Gesture and Gaze Recognition

Explainable Activity Recognition for Smart Home Systems

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

Explainable AI As Collaborative Task Solving.

From Pixels to Words: Leveraging Explainability in Face Recognition through Interactive Natural Language Processing

Interactive Plan Explicability in Human-Robot Teaming

UGotMe: An Embodied System for Affective Human-Robot Interaction