Abstract:Social robotics is an emerging area that is becoming present in social spaces, by introducing autonomous social robots. Social robots offer services, perform tasks, and interact with people in such social environments, demanding more efficient and complex Human-Robot Interaction (HRI) designs. A strategy to improve HRI is to provide robots with the capacity of detecting the emotions of the people around them to plan a trajectory, modify their behaviour, and generate an appropriate interaction with people based on the analysed information. However, in social environments in which it is common to find a group of persons, new approaches are needed in order to make robots able to recognise groups of people and the emotion of the groups, which can be also associated with a scene in which the group is participating. Some existing studies are focused on detecting group cohesion and the recognition of group emotions; nevertheless, these works do not focus on performing the recognition tasks from a robocentric perspective, considering the sensory capacity of robots. In this context, a system to recognise scenes in terms of groups of people, to then detect global (prevailing) emotions in a scene, is presented. The approach proposed to visualise and recognise emotions in typical HRI is based on the face size of people recognised by the robot during its navigation (face sizes decrease when the robot moves away from a group of people). On each frame of the video stream of the visual sensor, individual emotions are recognised based on the Visual Geometry Group (VGG) neural network pre-trained to recognise faces (VGGFace); then, to detect the emotion of the frame, individual emotions are aggregated with a fusion method, and consequently, to detect global (prevalent) emotion in the scene (group of people), the emotions of its constituent frames are also aggregated. Additionally, this work proposes a strategy to create datasets with images/videos in order to validate the estimation of emotions in scenes and personal emotions. Both datasets are generated in a simulated environment based on the Robot Operating System (ROS) from videos captured by robots through their sensory capabilities. Tests are performed in two simulated environments in ROS/Gazebo: a museum and a cafeteria. Results show that the accuracy in the detection of individual emotions is 99.79% and the detection of group emotion (scene emotion) in each frame is 90.84% and 89.78% in the cafeteria and the museum scenarios, respectively.

Group-Level Emotion Recognition Using a Unimodal Privacy-Safe Non-Individual Approach

Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild

Multimodal End-to-End Group Emotion Recognition using Cross-Modal Attention

Group-level Emotion Recognition Based on Faces, Scenes, Skeletons Features

Multi-View Common Space Learning For Emotion Recognition In The Wild

EmotioNet Challenge: Recognition of facial expressions of emotion in the wild

Group-level emotion recognition using transfer learning from face identification

Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition

AI in Pursuit of Happiness, Finding Only Sadness: Multi-Modal Facial Emotion Recognition Challenge

Temporal Multimodal Fusion for Video Emotion Classification in the Wild

Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Non-Volume Preserving-based Fusion to Group-Level Emotion Recognition on Crowd Videos

Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention

Group Emotion Detection Based on Social Robot Perception

Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning

Affect Analysis in-the-wild: Valence-Arousal, Expressions, Action Units and a Unified Framework

Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

Analyzing the Affect of a Group of People Using Multi-modal Framework