M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment

Long Nguyen-Phuoc,Renald Gaboriau,Dimitri Delacroix,Laurent Navarro

DOI: https://doi.org/10.5220/0012575100003660

2024-03-14

Abstract:This paper introduces the M&M model, a novel multimodal-multitask learning framework, applied to the AVCAffe dataset for cognitive load assessment (CLA). M&M uniquely integrates audiovisual cues through a dual-pathway architecture, featuring specialized streams for audio and video inputs. A key innovation lies in its cross-modality multihead attention mechanism, fusing the different modalities for synchronized multitasking. Another notable feature is the model's three specialized branches, each tailored to a specific cognitive load label, enabling nuanced, task-specific analysis. While it shows modest performance compared to the AVCAffe's single-task baseline, M\&M demonstrates a promising framework for integrated multimodal processing. This work paves the way for future enhancements in multimodal-multitask learning systems, emphasizing the fusion of diverse data types for complex task handling.

Computer Vision and Pattern Recognition,Multimedia,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper attempts to address the issue of how to effectively integrate multimodal data (audio and video) in Cognitive Load Assessment (CLA) and improve the accuracy and robustness of the assessment through a multitask learning framework. Specifically, the paper proposes a new model called M&M (Multimodal-Multitask Model), which addresses the above issues in the following ways: 1. **Multimodal Data Fusion**: The M&M model processes audio and video inputs separately through a dual-path architecture and integrates data from different modalities using a Cross-Modality Multihead Attention Mechanism, thereby achieving a comprehensive capture of cognitive load. 2. **Multitask Learning**: The M&M model includes three specialized branches, each targeting a specific cognitive load label (such as mental demand, effort level, time demand), which allows the model to perform detailed task-specific analysis, improving overall accuracy and robustness. 3. **Compact and Efficient Model Design**: The M&M model aims to provide a compact and efficient AI solution suitable for environments with limited computational resources, while also simplifying deployment in scenarios such as human-computer interaction. Overall, the M&M model provides a new framework for cognitive load assessment by integrating multimodal data and multitask learning, demonstrating its potential in handling complex tasks.

M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

A Multimodal Saliency Model For Videos With High Audio-Visual Correspondence

A developmental model of audio-visual attention (MAVA) for bimodal language learning in infants and robots

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Matryoshka Multimodal Models

CaMML: Context-Aware Multimodal Learner for Large Models

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Masked Co-Attention Model for Audio-Visual Event Localization

Multimodal Instruction Tuning with Hybrid State Space Models

m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Answering Diverse Questions via Text Attached with Key Audio-Visual Clues

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

MAVEN: A Memory Augmented Recurrent Approach for Multimodal Fusion

Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks