Abstract:Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal large language models (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the limitations of current multi - modal large language models (MLLMs) in long - term continuous interaction. Specifically, existing models face the following challenges when dealing with real - time video and audio streams: 1. **Limitations of Alternating Perception and Reasoning**: - Current MLLMs adopt a sequence - to - sequence architecture, which makes them unable to perform input processing and response generation simultaneously, similar to "being unable to think while perceiving". This architecture restricts the model's performance in continuous interaction. 2. **Problems in Long - term Memory Management**: - Existing models rely on long - context to store historical data, but this is impractical in long - term interaction because retaining all information is both costly and inefficient. Especially for scenarios that require continuous AI assistance, such as several days of service time, the amount of accumulated multi - modal data will rapidly increase to millions of tokens, leading to storage cost and efficiency issues. To solve these problems, the paper proposes a system named **InternLM - XComposer2.5 - OmniLive (IXC2.5 - OL)**. By introducing separate streaming perception, reasoning, and memory mechanisms, this system achieves efficient processing of real - time video and audio inputs and can simulate human cognitive processes to provide continuous and highly adaptable services. ### Specific Solutions The IXC2.5 - OL system consists of three key modules: 1. **Streaming Perception Module**: - Processes multi - modal information in real - time, stores key details in memory, and triggers the reasoning process when the user queries. - It includes a video perception module and an audio translation module, which process video and audio streams respectively. 2. **Multi - modal Long Memory Module**: - Integrates short - term and long - term memory, compresses short - term memory into long - term memory to improve retrieval efficiency and accuracy. - Continuously compresses short - term memory to make long - term memory more information - rich and convenient for rapid retrieval. 3. **Reasoning Module**: - Responds to queries and performs reasoning tasks based on the information provided by the perception and memory modules. - It is the core cognitive component of the system and is responsible for handling complex reasoning tasks. Through the design of these modules, IXC2.5 - OL overcomes the limitations of existing MLLMs in continuous perception and reasoning, achieving more natural and long - lasting human - machine interaction. ### Summary The main objective of this paper is to develop an AI system that can continuously interact with the environment over a long period, simulating human cognitive abilities. By introducing separate perception, memory, and reasoning modules, IXC2.5 - OL addresses the deficiencies of existing models in real - time processing and long - term memory management, providing a more efficient and natural multi - modal interaction experience.

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Chat with the Environment: Interactive Multimodal Perception Using Large Language Models

Interactive Continual Learning: Fast and Slow Thinking

Digital Life Project: Autonomous 3D Characters with Social Intelligence

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

VideoLLM-online: Online Video Large Language Model for Streaming Video

OmniBench: Towards The Future of Universal Omni-Language Models

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

Towards LifeSpan Cognitive Systems

MemoryBank: Enhancing Large Language Models with Long-Term Memory

MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments