Abstract:Embodied artificial intelligence emphasizes the role of an agent's body in generating human-like behaviors. The recent efforts on EmbodiedAI pay a lot of attention to building up machine learning models to possess perceiving, planning, and acting abilities, thereby enabling real-time interaction with the world. However, most works focus on bounded indoor environments, such as navigation in a room or manipulating a device, with limited exploration of embodying the agents in open-world scenarios. That is, embodied intelligence in the open and outdoor environment is less explored, for which one potential reason is the lack of high-quality simulators, benchmarks, and datasets. To address it, in this paper, we construct a benchmark platform for embodied intelligence evaluation in real-world city environments. Specifically, we first construct a highly realistic 3D simulation environment based on the real buildings, roads, and other elements in a real city. In this environment, we combine historically collected data and simulation algorithms to conduct simulations of pedestrian and vehicle flows with high fidelity. Further, we designed a set of evaluation tasks covering different EmbodiedAI abilities. Moreover, we provide a complete set of input and output interfaces for access, enabling embodied agents to easily take task requirements and current environmental observations as input and then make decisions and obtain performance evaluations. On the one hand, it expands the capability of existing embodied intelligence to higher levels. On the other hand, it has a higher practical value in the real world and can support more potential applications for artificial general intelligence. Based on this platform, we evaluate some popular large language models for embodied intelligence capabilities of different dimensions and difficulties.

EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models

SpatialBot: Precise Spatial Understanding with Vision Language Models

ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

Things not Written in Text: Exploring Spatial Commonsense from Visual Signals

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning