Abstract:Communicating shapes our social word. For a robot to be considered social and being consequently integrated in our social environment it is fundamental to understand some of the dynamics that rule human-human communication. In this work, we tackle the problem of Addressee Estimation, the ability to understand an utterance's addressee, by interpreting and exploiting non-verbal bodily cues from the speaker. We do so by implementing an hybrid deep learning model composed of convolutional layers and LSTM cells taking as input images portraying the face of the speaker and 2D vectors of the speaker's body posture. Our implementation choices were guided by the aim to develop a model that could be deployed on social robots and be efficient in ecological scenarios. We demonstrate that our model is able to solve the Addressee Estimation problem in terms of addressee localisation in space, from a robot ego-centric point of view.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to develop a deep - learning model that enables social robots to have the ability of addressee estimation. Specifically, the model aims to determine who the speaker is talking to in multi - person interaction scenarios by interpreting and utilizing the speaker's non - verbal body cues (such as eye contact and gestures). ### Background and Importance of the Problem In human society, communication is a key activity in shaping the social world. In order for robots to integrate into the human social environment, it is very important to understand some basic communication dynamics (for example, who the message is for). For social robots, this ability is especially crucial, especially in complex scenarios that go beyond simple one - to - one interactions. Understanding the addressee can help robots: 1. **Identify Implicit Commands**: Distinguish which instructions are for the robot. 2. **Understand Social Dynamics and Roles**: Identify different social roles in multi - party interactions. 3. **Correctly Interpret Sentence Meanings**: Especially sentences containing demonstrative pronouns (such as "you", "he", "she", "they", etc.). ### Research Objectives The goal of this paper is to implement an Addressee Estimation model to enhance human - robot interaction (HRI). Specifically, the model aims to solve the problem in the following ways: - **Locate the Addressee**: From the robot's own perspective, determine the position of the addressee in space. - **Multi - class Classification**: Divide the position of the addressee into three categories (left, right, or the robot itself). - **Use Only Visual Information**: Based on the speaker's facial image and body posture vector, without relying on other external devices. - **Apply in Ecological Scenarios**: Ensure that the model can operate in more realistic and natural environments. ### Method Overview The researchers designed a hybrid deep neural network (CNN + LSTM) that can process two input modalities: the speaker's facial image and body posture vector. Features are extracted through the convolutional layer, and time - series patterns are captured through the LSTM layer, ultimately achieving the estimation of the addressee. ### Dataset and Experiments This research used the Vernissage Corpus dataset, which records multi - party interactions between two human participants and the Nao robot. Through a series of pre - processing steps, including segmenting speech segments, extracting body postures and facial images, and data augmentation, the researchers constructed a dataset for training and testing the model. ### Experimental Results Through 10 - fold cross - validation, the researchers evaluated the performance of the model. The results showed that the intermediate fusion model performed best in terms of the weighted F1 score, achieving an accuracy rate of approximately 75%. In addition, other experiments were carried out to verify the importance of different input modalities and to compare with binary classification tasks. In conclusion, this paper aims to enable social robots to accurately estimate the addressee in complex social scenarios through deep - learning techniques, thereby improving their performance in human - robot interaction.

To Whom are You Talking? A Deep Learning Model to Endow Social Robots with Addressee Estimation Skills

Addressee Detection Using Facial and Audio Features in Mixed Human–Human and Human–Robot Settings: A Deep Learning Framework

Data-driven emotional body language generation for social robotics

Deep Learning Based Multi-modal Addressee Recognition in Visual Scenes with Utterances

A Multi-Modal Explainability Approach for Human-Aware Robots in Multi-Party Conversation

Emotional Communication Robot Based on 3D Face Model and ASR Technology

Body Gesture Recognition to Control a Social Robot

A Lightweight Network-Based Sign Language Robot with Facial Mirroring and Speech System

Creating Expressive Social Robots That Convey Symbolic and Spontaneous Communication

Emotion recognition models for companion robots

An emotion-driven and topic-aware dialogue framework for human–robot interaction

Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction

Why Robots Should Be Social: Enhancing Machine Learning through Social Human-Robot Interaction

A hybrid deep learning neural approach for emotion recognition from facial expressions for socially assistive robots

Learning to see people like people

Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source Localization

Deep Q-network for social robotics using emotional social signals

ExpressionBot: An Emotive Lifelike Robotic Face for Face-to-Face Communication

When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration

Improving Human-Robot Interaction by Enhancing NAO Robot Awareness of Human Facial Expression

An Approach to Elicit Human-Understandable Robot Expressions to Support Human-Robot Interaction