InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Pan Zhang,Xiaoyi Dong,Yuhang Cao,Yuhang Zang,Rui Qian,Xilin Wei,Lin Chen,Yifei Li,Junbo Niu,Shuangrui Ding,Qipeng Guo,Haodong Duan,Xin Chen,Han Lv,Zheng Nie,Min Zhang,Bin Wang,Wenwei Zhang,Xinyue Zhang,Jiaye Ge,Wei Li,Jingwen Li,Zhongying Tu,Conghui He,Xingcheng Zhang,Kai Chen,Yu Qiao,Dahua Lin,Jiaqi Wang
2024-12-13
Abstract:Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal large language models (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the limitations of current multi - modal large language models (MLLMs) in long - term continuous interaction. Specifically, existing models face the following challenges when dealing with real - time video and audio streams: 1. **Limitations of Alternating Perception and Reasoning**: - Current MLLMs adopt a sequence - to - sequence architecture, which makes them unable to perform input processing and response generation simultaneously, similar to "being unable to think while perceiving". This architecture restricts the model's performance in continuous interaction. 2. **Problems in Long - term Memory Management**: - Existing models rely on long - context to store historical data, but this is impractical in long - term interaction because retaining all information is both costly and inefficient. Especially for scenarios that require continuous AI assistance, such as several days of service time, the amount of accumulated multi - modal data will rapidly increase to millions of tokens, leading to storage cost and efficiency issues. To solve these problems, the paper proposes a system named **InternLM - XComposer2.5 - OmniLive (IXC2.5 - OL)**. By introducing separate streaming perception, reasoning, and memory mechanisms, this system achieves efficient processing of real - time video and audio inputs and can simulate human cognitive processes to provide continuous and highly adaptable services. ### Specific Solutions The IXC2.5 - OL system consists of three key modules: 1. **Streaming Perception Module**: - Processes multi - modal information in real - time, stores key details in memory, and triggers the reasoning process when the user queries. - It includes a video perception module and an audio translation module, which process video and audio streams respectively. 2. **Multi - modal Long Memory Module**: - Integrates short - term and long - term memory, compresses short - term memory into long - term memory to improve retrieval efficiency and accuracy. - Continuously compresses short - term memory to make long - term memory more information - rich and convenient for rapid retrieval. 3. **Reasoning Module**: - Responds to queries and performs reasoning tasks based on the information provided by the perception and memory modules. - It is the core cognitive component of the system and is responsible for handling complex reasoning tasks. Through the design of these modules, IXC2.5 - OL overcomes the limitations of existing MLLMs in continuous perception and reasoning, achieving more natural and long - lasting human - machine interaction. ### Summary The main objective of this paper is to develop an AI system that can continuously interact with the environment over a long period, simulating human cognitive abilities. By introducing separate perception, memory, and reasoning modules, IXC2.5 - OL addresses the deficiencies of existing models in real - time processing and long - term memory management, providing a more efficient and natural multi - modal interaction experience.