Abstract:With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at <a class="link-external link-https" href="https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper primarily aims to address several key issues in the field of video understanding and explores how to leverage large-scale language models (LLMs) to enhance video understanding capabilities. Specifically: 1. **Enhancement of Video Understanding Capabilities**: With the rapid development of online video platforms and the surge in video content, the demand for efficient video understanding tools is increasing. The paper proposes methods to achieve higher-level understanding and reasoning in video analysis tasks by integrating the power of large-scale language models. 2. **Fusion of Multimodal Understanding**: Large-scale language models are introduced into video understanding due to their remarkable performance in language and multimodal tasks. This enhances the model's ability to understand abstract, temporal, and spatiotemporal aspects, and to perform open-ended, multi-granularity reasoning by incorporating common sense knowledge. 3. **Classification and Method Review**: The paper categorizes video understanding methods based on large-scale language models (Vid-LLMs) into three major types: combining video analyzers with LLMs, combining video embedders with LLMs, and combining both. These are further subdivided into five subtypes, including LLMs as summarizers, managers, text decoders, regressors, and hidden layer roles. 4. **Applications and Future Directions**: In addition to summarizing existing methods and technologies, the paper discusses the wide-ranging applications of these technologies in different fields. It also points out the current limitations of Vid-LLMs and future research directions, aiming to advance video understanding technology towards a level closer to human understanding. In summary, this paper aims to fill the gap in the existing literature regarding the review of video understanding tasks based on large-scale language models. It systematically introduces the latest developments in this field and their potential application prospects.

Video Understanding with Large Language Models: A Survey

VideoLLM: Modeling Video Sequence with Large Language Models

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

LongVLM: Efficient Long Video Understanding via Large Language Models

Understanding Long Videos with Multimodal Language Models

Streaming Long Video Understanding with Large Language Models

VideoLLM-online: Online Video Large Language Model for Streaming Video

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

Audio-Visual LLM for Video Understanding

VideoQA in the Era of LLMs: An Empirical Study

VLM-Eval: A General Evaluation on Video Large Language Models

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

Efficient Multimodal Large Language Models: A Survey

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

A Survey on Evaluation of Large Language ModelsJust Accepted

A Survey on Evaluation of Large Language Models

LinVT: Empower Your Image-level Large Language Model to Understand Videos

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

Evaluating Large Language Models: A Comprehensive Survey

ST-LLM: Large Language Models Are Effective Temporal Learners