Abstract:Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the Blockwise RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, Blockwise Transformers, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.

What problem does this paper attempt to address?

The main problem this paper attempts to address is the current language models' inadequacies in understanding and handling complex, long-form tasks, especially those involving aspects of the world that are difficult to describe in words. Video sequences provide temporal information that is lacking in language and static images, making them an ideal choice for joint modeling with language. However, learning from millions of video and language sequences faces challenges such as memory limitations, computational complexity, and limited datasets. To this end, the authors propose a large dataset containing diverse videos and books and utilize Blockwise RingAttention technology to train long-sequence models at scale, gradually increasing the context size from 4K to 1M tokens. Specifically, the main contributions of the paper include: 1. **Neural Networks with Maximum Context Size**: Trained a transformer model capable of handling long video and language sequences with the maximum context size, setting new benchmarks in challenging retrieval tasks and long video understanding. 2. **Solutions to Overcome Vision-Language Training Challenges**: Including the use of masked sequence packing to mix sequences of different lengths, loss weighting to balance language and vision, and generating an automatically created question-answer dataset for long-sequence chat models. 3. **Highly Optimized Implementation**: Featuring key characteristics such as RingAttention, Blockwise Transformers, and masked sequence packing, used to train multimodal sequences of millions in length. 4. **Fully Open-Sourced 7B Parameter Model Series**: Capable of handling long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) with over 1M tokens, these models can process long texts and videos, thereby supporting a wider range of tasks. Through these contributions, this research paves the way for training large-scale long video and language datasets, aiding in the development of more advanced AI systems that understand both human knowledge and the multimodal world.

World Model on Million-Length Video And Language With Blockwise RingAttention

Long Context Transfer from Language to Vision

Ring Attention with Blockwise Transformers for Near-Infinite Context

Streaming Long Video Understanding with Large Language Models

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Visual Context Window Extension: A New Perspective for Long Video Understanding

Understanding Long Videos with Multimodal Language Models

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

VideoLLM: Modeling Video Sequence with Large Language Models

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Blockwise Parallel Transformer for Large Context Models

Towards Long-Form Video Understanding

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

LongVLM: Efficient Long Video Understanding via Large Language Models

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning