Abstract:Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. Existing benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. We introduce MIntRec2.0, a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. It contains 1,245 dialogues with 15,040 samples, each annotated within a new intent taxonomy of 30 fine-grained classes. Besides 9,304 in-scope samples, it also includes 5,736 out-of-scope samples appearing in multi-turn contexts, which naturally occur in real-world scenarios. Furthermore, we provide comprehensive information on the speakers in each utterance, enriching its utility for multi-party conversational research. We establish a general framework supporting the organization of single-turn and multi-turn dialogue data, modality feature extraction, multimodal fusion, as well as in-scope classification and out-of-scope detection. Evaluation benchmarks are built using classic multimodal fusion methods, ChatGPT, and human evaluators. While existing methods incorporating nonverbal information yield improvements, effectively leveraging context information and detecting out-of-scope samples remains a substantial challenge. Notably, large language models exhibit a significant performance gap compared to humans, highlighting the limitations of machine learning methods in the cognitive intent understanding task. We believe that MIntRec2.0 will serve as a valuable resource, providing a pioneering foundation for research in human-machine conversational interactions, and significantly facilitating related applications. The full dataset and codes are available at <a class="link-external link-https" href="https://github.com/thuiar/MIntRec2.0" rel="external noopener nofollow">this https URL</a>.

Vid2Int: Detecting Implicit Intention from Long Dialog Videos

Research on Implicit Intent Recognition Method Based on Prompt Learning

A Prompt Learning Based Intent Recognition Method on a Chinese Implicit Intent Dataset CIID

Similarity Learning with Implicit-Network and Explicit-Network for Zero-Shot Intent Detection

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Nonverbal Interaction Detection

VideoDistill: Language-aware Vision Distillation for Video Question Answering

Interaction-Integrated Network for Natural Language Moment Localization.

Constructing Robust Emotional State-based Feature with a Novel Voting Scheme for Multi-modal Deception Detection in Videos

Uncovering the Unseen: Discover Hidden Intentions by Micro-Behavior Graph Reasoning

SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection

EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

In Your Eyes: Modality Disentangling for Personality Analysis in Short Video

Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition

Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention