MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Ling-Hao Chen,Shunlin Lu,Ailing Zeng,Hao Zhang,Benyou Wang,Ruimao Zhang,Lei Zhang
2024-05-31
Abstract:This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively. In light of this, we present MotionLLM, a straightforward yet effective framework for human motion understanding, captioning, and reasoning. Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to glean rich spatial-temporal insights. Furthermore, we collect a substantial dataset, MoVid, comprising diverse videos, motions, captions, and instructions. Additionally, we propose the MoVid-Bench, with carefully manual annotations, for better evaluation of human behavior understanding on video and motion. Extensive experiments show the superiority of MotionLLM in the caption, spatial-temporal comprehension, and reasoning ability.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper mainly discusses the problem of human behavior understanding, especially the multimodal understanding through videos and motion sequences (such as SMPL sequences). Current research mostly focuses on the separate understanding of videos or motions, while the paper proposes a new framework called MotionLLM, which combines videos and motion data to capture body part dynamics and semantics more effectively. MotionLLM adopts a unified video-motion training strategy, leveraging the complementary advantages of existing coarse video-text data and fine-grained motion-text data to gain spatial-temporal insights. In the paper, the authors construct a large-scale dataset called MoVid, which includes diverse videos, motions, captions, and instructions to support different tasks and training stages. Additionally, they create a MoVid-Bench benchmark to better evaluate the fine-grained capabilities of video and motion understanding, including the evaluation of sequential dynamics, body part semantics, direction awareness, and reasoning abilities. The experimental results demonstrate the superiority of MotionLLM in video and motion understanding, with an average improvement of 38% and 15% compared to existing methods. Furthermore, MotionLLM can also be applied to downstream tasks such as intelligent fitness coaching, especially suitable for the visually impaired community. In conclusion, this paper aims to address the challenges of deep understanding of human behavior through video and motion data, proposing a new model and dataset to facilitate more accurate, robust, and context-rich dynamic and semantic understanding.