Abstract:While Vision-Language Models (VLMs) hold promise for tasks requiring extensive collaboration, traditional multi-agent simulators have facilitated rich explorations of an interactive artificial society that reflects collective behavior. However, these existing simulators face significant limitations. Firstly, they struggle with handling large numbers of agents due to high resource demands. Secondly, they often assume agents possess perfect information and limitless capabilities, hindering the ecological validity of simulated social interactions. To bridge this gap, we propose a multi-agent Minecraft simulator, MineLand, that bridges this gap by introducing three key features: large-scale scalability, limited multimodal senses, and physical needs. Our simulator supports 64 or more agents. Agents have limited visual, auditory, and environmental awareness, forcing them to actively communicate and collaborate to fulfill physical needs like food and resources. Additionally, we further introduce an AI agent framework, Alex, inspired by multitasking theory, enabling agents to handle intricate coordination and scheduling. Our experiments demonstrate that the simulator, the corresponding benchmark, and the AI agent framework contribute to more ecological and nuanced collective behavior.The source code of MineLand and Alex is openly available at

What problem does this paper attempt to address?

This paper attempts to address the limitations of existing multi - agent open - world simulators in handling large - scale agent scenarios, assuming agents have perfect information and infinite capabilities by proposing a multi - agent Minecraft simulator named MineLand. These issues lead to insufficient ecological validity in simulating social interactions, that is, there are significant differences between interactions in the simulated environment and those of humans in the real world. To bridge this gap, MineLand introduces three key features: large - scale scalability, limited multi - modal perception capabilities, and physiological needs. The introduction of these features aims to enable the simulator to support a larger number of agents while more realistically reflecting real - world social interactions. Specifically, MineLand addresses the above problems in the following ways: 1. **Large - scale Scalability**: By optimizing the performance overhead of each Minecraft client, MineLand can support 64 or more agents on mainstream consumer - level desktop computers, while traditional simulators can usually only support 2 agents. 2. **Limited Multi - modal Perception Capabilities**: Agents in the simulator have a partially observable environment, an egocentric perspective, and limited visual and auditory perception capabilities. This design mimics the influence of factors such as distance, terrain, and environment on visibility and hearing in real life, restricting information acquisition and forcing agents to actively communicate to compensate for sensory deficiencies. 3. **Physiological Needs**: Agents need to meet basic physiological needs, such as food, survival, and resource management, which adds daily routines in the time dimension. This setting requires cooperation and competition among agents to obtain resources, reflecting the complex interaction between cooperation and self - interest in human society. Through these improvements, MineLand not only improves the ecological validity of multi - agent simulation but also provides a rich platform for evaluating multi - agent capabilities based on large - language models (LLMs) or multi - modal language models (VLMs). In addition, the paper also proposes an AI agent framework named Alex, which is inspired by the multi - task theory in the cognitive field and can perform complex coordination and scheduling tasks simultaneously, further enhancing the capabilities of agents.

MineLand: Simulating Large-Scale Multi-Agent Interactions with Limited Multimodal Senses and Physical Needs

MindAgent: Emergent Gaming Interaction

LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation

Synergistic Simulations: Multi-Agent Problem Solving with Large Language Models

MineStudio: A Streamlined Package for Minecraft AI Agent Development

TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft

MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge

MinsStudio: A Streamlined Package for Minecraft AI Agent Development

See and Think: Embodied Agent in Virtual Environment

SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems

Improving Agent Interactions in Virtual Environments with Language Models

Scaling Instructable Agents Across Many Simulated Worlds

User Behavior Simulation with Large Language Model based Agents

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments Via Large Language Models with Text-based Knowledge and Memory

Odyssey: Empowering Minecraft Agents with Open-World Skills

Chat with the Environment: Interactive Multimodal Perception Using Large Language Models

STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft

MP5: A Multi-modal Open-ended Embodied System in Minecraft Via Active Perception

VillagerAgent: A Graph-Based Multi-Agent Framework for Coordinating Complex Task Dependencies in Minecraft

MineAgent: Towards Remote-Sensing Mineral Exploration with Multimodal Large Language Models