MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Ling-Hao Chen,Shunlin Lu,Ailing Zeng,Hao Zhang,Benyou Wang,Ruimao Zhang,Lei Zhang

2024-05-31

Abstract:This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively. In light of this, we present MotionLLM, a straightforward yet effective framework for human motion understanding, captioning, and reasoning. Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to glean rich spatial-temporal insights. Furthermore, we collect a substantial dataset, MoVid, comprising diverse videos, motions, captions, and instructions. Additionally, we propose the MoVid-Bench, with carefully manual annotations, for better evaluation of human behavior understanding on video and motion. Extensive experiments show the superiority of MotionLLM in the caption, spatial-temporal comprehension, and reasoning ability.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper mainly discusses the problem of human behavior understanding, especially the multimodal understanding through videos and motion sequences (such as SMPL sequences). Current research mostly focuses on the separate understanding of videos or motions, while the paper proposes a new framework called MotionLLM, which combines videos and motion data to capture body part dynamics and semantics more effectively. MotionLLM adopts a unified video-motion training strategy, leveraging the complementary advantages of existing coarse video-text data and fine-grained motion-text data to gain spatial-temporal insights. In the paper, the authors construct a large-scale dataset called MoVid, which includes diverse videos, motions, captions, and instructions to support different tasks and training stages. Additionally, they create a MoVid-Bench benchmark to better evaluate the fine-grained capabilities of video and motion understanding, including the evaluation of sequential dynamics, body part semantics, direction awareness, and reasoning abilities. The experimental results demonstrate the superiority of MotionLLM in video and motion understanding, with an average improvement of 38% and 15% compared to existing methods. Furthermore, MotionLLM can also be applied to downstream tasks such as intelligent fitness coaching, especially suitable for the visually impaired community. In conclusion, this paper aims to address the challenges of deep understanding of human behavior through video and motion data, proposing a new model and dataset to facilitate more accurate, robust, and context-rich dynamic and semantic understanding.

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

VideoLLM: Modeling Video Sequence with Large Language Models

Understanding Long Videos with Multimodal Language Models

Human Motion Instruction Tuning

MotionLLaMA: A Unified Framework for Motion Synthesis and Comprehension

Video Understanding with Large Language Models: A Survey

TR-LLM: Integrating Trajectory Data for Scene-Aware LLM-Based Human Action Prediction

Audio-Visual LLM for Video Understanding

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

MoChat: Joints-Grouped Spatio-Temporal Grounding LLM for Multi-Turn Motion Comprehension and Description

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

iMotion-LLM: Motion Prediction Instruction Tuning

Large Motion Model for Unified Multi-Modal Motion Generation

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

EgoLM: Multi-Modal Language Model of Egocentric Motions

MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation