To Whom are You Talking? A Deep Learning Model to Endow Social Robots with Addressee Estimation Skills

Carlo Mazzola,Marta Romeo,Francesco Rea,Alessandra Sciutti,Angelo Cangelosi
DOI: https://doi.org/10.1109/IJCNN54540.2023.10191452
2024-03-28
Abstract:Communicating shapes our social word. For a robot to be considered social and being consequently integrated in our social environment it is fundamental to understand some of the dynamics that rule human-human communication. In this work, we tackle the problem of Addressee Estimation, the ability to understand an utterance's addressee, by interpreting and exploiting non-verbal bodily cues from the speaker. We do so by implementing an hybrid deep learning model composed of convolutional layers and LSTM cells taking as input images portraying the face of the speaker and 2D vectors of the speaker's body posture. Our implementation choices were guided by the aim to develop a model that could be deployed on social robots and be efficient in ecological scenarios. We demonstrate that our model is able to solve the Addressee Estimation problem in terms of addressee localisation in space, from a robot ego-centric point of view.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop a deep - learning model that enables social robots to have the ability of addressee estimation. Specifically, the model aims to determine who the speaker is talking to in multi - person interaction scenarios by interpreting and utilizing the speaker's non - verbal body cues (such as eye contact and gestures). ### Background and Importance of the Problem In human society, communication is a key activity in shaping the social world. In order for robots to integrate into the human social environment, it is very important to understand some basic communication dynamics (for example, who the message is for). For social robots, this ability is especially crucial, especially in complex scenarios that go beyond simple one - to - one interactions. Understanding the addressee can help robots: 1. **Identify Implicit Commands**: Distinguish which instructions are for the robot. 2. **Understand Social Dynamics and Roles**: Identify different social roles in multi - party interactions. 3. **Correctly Interpret Sentence Meanings**: Especially sentences containing demonstrative pronouns (such as "you", "he", "she", "they", etc.). ### Research Objectives The goal of this paper is to implement an Addressee Estimation model to enhance human - robot interaction (HRI). Specifically, the model aims to solve the problem in the following ways: - **Locate the Addressee**: From the robot's own perspective, determine the position of the addressee in space. - **Multi - class Classification**: Divide the position of the addressee into three categories (left, right, or the robot itself). - **Use Only Visual Information**: Based on the speaker's facial image and body posture vector, without relying on other external devices. - **Apply in Ecological Scenarios**: Ensure that the model can operate in more realistic and natural environments. ### Method Overview The researchers designed a hybrid deep neural network (CNN + LSTM) that can process two input modalities: the speaker's facial image and body posture vector. Features are extracted through the convolutional layer, and time - series patterns are captured through the LSTM layer, ultimately achieving the estimation of the addressee. ### Dataset and Experiments This research used the Vernissage Corpus dataset, which records multi - party interactions between two human participants and the Nao robot. Through a series of pre - processing steps, including segmenting speech segments, extracting body postures and facial images, and data augmentation, the researchers constructed a dataset for training and testing the model. ### Experimental Results Through 10 - fold cross - validation, the researchers evaluated the performance of the model. The results showed that the intermediate fusion model performed best in terms of the weighted F1 score, achieving an accuracy rate of approximately 75%. In addition, other experiments were carried out to verify the importance of different input modalities and to compare with binary classification tasks. In conclusion, this paper aims to enable social robots to accurately estimate the addressee in complex social scenarios through deep - learning techniques, thereby improving their performance in human - robot interaction.