Abstract:Over the past decade, wearable computing devices (``smart glasses'') have undergone remarkable advancements in sensor technology, design, and processing power, ushering in a new era of opportunity for high-density human behavior data. Equipped with wearable cameras, these glasses offer a unique opportunity to analyze non-verbal behavior in natural settings as individuals interact. Our focus lies in predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion. Leveraging such analyses may revolutionize our understanding of human communication, foster more effective collaboration in professional environments, provide better mental health support through empathetic virtual interactions, and enhance accessibility for those with communication barriers. In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation. We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a ``multimodal transcript'' that can be processed by an LLM for behavioral reasoning tasks. Remarkably, this method achieves performance comparable to established fusion techniques even in its preliminary implementation, indicating strong potential for further research and optimization. This fusion method is one of the first to approach ``reasoning'' about real-world human behavior through a language model. Smart glasses provide us the ability to unobtrusively gather high-density multimodal data on human behavior, paving the way for new approaches to understanding and improving human communication with the potential for important societal benefits. The features and data collected during the studies will be made publicly available to promote further research.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to predict engagement in natural conversations through multimodal data?** Specifically, researchers aim to detect the degree of interest or confusion of both parties in a conversation by analyzing human behaviors in the conversation, including verbal and non - verbal cues. The ultimate goal of this research is to improve human communication, promote more effective collaboration, provide better mental health support, and enhance the communication ability of those with communication disorders. ### Research Background and Problem Description In recent years, wearable computing devices (such as smart glasses) have made significant progress in sensor technology, design, and processing power, making it possible to collect high - density human behavior data. These smart glasses are equipped with sensors such as video - scene cameras, eye - tracking cameras, microphones, and inertial measurement units, which can capture and respond to human behavior in real - time. However, accurately measuring and evaluating engagement in conversations remains a huge challenge, mainly because: 1. **Complexity and Subtlety of Human Behavior**: Human behavior is influenced by personal history and cultural background and is highly context - dependent and variable. 2. **Multifaceted Nature of Social Interaction**: Engagement is not only reflected in verbal content but also involves multiple non - verbal cues such as intonation, facial expressions, and gestures. 3. **Lack of Relevant Data**: Although there are many publicly available two - person interaction datasets, most are recorded from a third - party perspective, and relatively little natural conversation data is recorded from a first - person perspective. ### Research Methods and Contributions To solve these problems, this paper makes the following two main contributions: 1. **Introduction of a New Dataset**: Researchers collected a natural conversation dataset containing 34 participants (19 unique pairs). Each conversation was recorded using Pupil Invisible smart glasses, recording video, audio, eye - tracking, and participants' self - reported information (such as demographics, political views, and personality traits). 2. **Multimodal Fusion Method Based on Large Language Models (LLM)**: Researchers proposed a new fusion strategy, using large language models (LLM) to integrate multiple behavior modalities (such as voice, eye movement, facial expressions, etc.) into a multimodal text representation (multimodal transcript). This method enables LLM to answer questionnaires about conversation engagement like real participants, thereby predicting the engagement level. ### Method Overview - **Multimodal Data Collection**: Collect video, audio, eye - tracking, etc. data through smart glasses and other sensors. - **Feature Extraction**: Extract facial expression features from video, determine the gaze direction from eye - tracking data, and transcribe conversation content from audio. - **Multimodal Fusion**: Convert the data of the above different modalities into text form and construct a multimodal conversation record as the input of LLM. - **Engagement Prediction**: Use LLM to predict engagement based on the multimodal conversation record and compare it with traditional fusion methods. ### Results and Prospects Preliminary results show that the multimodal fusion method based on LLM is comparable in performance to existing classic fusion techniques, indicating that this method has great potential for further research and optimization. Future work can explore more types of multimodal data and improve the application of LLM in natural conversations to better understand and predict human behavior. In this way, researchers hope to develop more intelligent technologies to help people better understand each other, improve the quality of communication, and bring important benefits to society.

Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation

A multimodal approach for modeling engagement in conversation

DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation

Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning

Deep Learning–Based Multimodal Data Fusion: Case Study in Food Intake Episodes Detection Using Wearable Sensors (Preprint)

Multimodal Language Analysis with Recurrent Multistage Fusion

Chat with the Environment: Interactive Multimodal Perception Using Large Language Models

Consumer Intention Recognition and Behavior Prediction of Social E-commerce Users Based on Multimodal Fusion

Multimodal Engagement Analysis from Facial Videos in the Classroom

A CNN-based Human Activity Recognition System Combining a Laser Feedback Interferometry Eye Movement Sensor and an IMU for Context-aware Smart Glasses

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Deep Learning–Based Multimodal Data Fusion: Case Study in Food Intake Episodes Detection Using Wearable Sensors

Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations

Jointly Optimizing Sensing Pipelines for Multimodal Mixed Reality Interaction

SocialMind: LLM-based Proactive AR Social Assistive System with Human-like Perception for In-situ Live Interactions

Multimodal Fusion with Pre-Trained Model Features in Affective Behaviour Analysis In-the-wild

Deep Multimodal Data Fusion

A multimodal fusion enabled ensemble approach for human activity recognition in smart homes

Multimodal Fusion Using Deep Learning Applied to Driver's Referencing of Outside-Vehicle Objects

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

EMMA: Efficient Visual Alignment in Multi-Modal LLMs