IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Yatai Ji,Shilong Zhang,Jie Wu,Peize Sun,Weifeng Chen,Xuefeng Xiao,Sidi Yang,Yujiu Yang,Ping Luo

2024-07-10

Abstract:The rapid advancement of Large Vision-Language models (LVLMs) has demonstrated a spectrum of emergent capabilities. Nevertheless, current models only focus on the visual content of a single scenario, while their ability to associate instances across different scenes has not yet been explored, which is essential for understanding complex visual content, such as movies with multiple characters and intricate plots. Towards movie understanding, a critical initial step for LVLMs is to unleash the potential of character identities memory and recognition across multiple visual scenarios. To achieve the goal, we propose visual instruction tuning with ID reference and develop an ID-Aware Large Vision-Language Model, IDA-VLM. Furthermore, our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions: matching, location, question-answering, and captioning. Our findings highlight the limitations of existing LVLMs in recognizing and associating instance identities with ID reference. This paper paves the way for future artificial intelligence systems to possess multi-identity visual inputs, thereby facilitating the comprehension of complex visual narratives like movies.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of cross - scene instance association ability in existing large - scale visual - language models (LVLMs) in complex multi - identity visual understanding tasks. Specifically, the paper points out that current LVLMs mainly focus on visual content processing within a single scene and fail to effectively remember and recognize the same instance in different scenes (such as a character in a movie). This ability is crucial for understanding visual content (such as movies) that contains multiple characters and complex plots. Therefore, the paper proposes a new method. By introducing ID - aware visual instruction tuning (visual instruction tuning with ID reference), a model named IDA - VLM is developed, aiming to enhance the ability of LVLMs in cross - scene identification and memory of instance identities. The main contributions of the paper include: 1. **First exploration**: This is the first attempt to study the ID - aware ability of LVLMs when processing complex multi - identity visual content such as movies. 2. **Model and dataset**: Propose the method of visual instruction tuning and construct the corresponding tuning dataset to train the model to be able to recognize and remember identity information across scenes. 3. **Benchmark test**: Propose a new benchmark test MM - ID for evaluating the performance of LVLMs in identity memory and recognition. MM - ID covers four increasingly complex tasks: matching, location, question - answering, and description generation. Through these contributions, the paper not only shows the limitations of existing LVLMs in cross - scene identity recognition but also proposes an effective solution, providing a new direction for future artificial intelligence systems.

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

An Introduction to Vision-Language Modeling

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Tackling Vision Language Tasks Through Learning Inner Monologues

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Understanding Long Videos with Multimodal Language Models

Valley: Video Assistant with Large Language model Enhanced abilitY

Visual In-Context Learning for Large Vision-Language Models

Audio-Visual LLM for Video Understanding

InfMLLM: A Unified Framework for Visual-Language Tasks.

HumanVLM: Foundation for Human-Scene Vision-Language Model

VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models

A-VL: Adaptive Attention for Large Vision-Language Models

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases