ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities

Ying Su,Zhan Ling,Haochen Shi,Jiayang Cheng,Yauwai Yim,Yangqiu Song

2024-10-05

Abstract:Large language models~(LLMs) have been adopted to process textual task description and accomplish procedural planning in embodied AI tasks because of their powerful reasoning ability. However, there is still lack of study on how vision language models~(VLMs) behave when multi-modal task inputs are considered. Counterfactual planning that evaluates the model's reasoning ability over alternative task situations are also under exploited. In order to evaluate the planning ability of both multi-modal and counterfactual aspects, we propose ActPlan-1K. ActPlan-1K is a multi-modal planning benchmark constructed based on ChatGPT and household activity simulator iGibson2. The benchmark consists of 153 activities and 1,187 instances. Each instance describing one activity has a natural language task description and multiple environment images from the simulator. The gold plan of each instance is action sequences over the objects in provided scenes. Both the correctness and commonsense satisfaction are evaluated on typical VLMs. It turns out that current VLMs are still struggling at generating human-level procedural plans for both normal activities and counterfactual activities. We further provide automatic evaluation metrics by finetuning over BLEURT model to facilitate future research on our benchmark.

Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the issue of evaluating the program planning capabilities of Visual Language Models (VLMs) under multimodal task inputs in household activities. Specifically: - Current Large Language Models (LLMs) have demonstrated strong reasoning abilities in processing text task descriptions and generating program plans, but there is insufficient research on the performance of Visual Language Models (VLMs) under multimodal task inputs. - The paper proposes a new benchmark dataset, ActPlan-1K, which is constructed based on ChatGPT and the household activity simulator iGibson2. It includes 153 activities and 1187 instances, with each instance comprising natural language task descriptions and multiple environment images. - By evaluating the action sequence plans generated by VLMs, it is found that current VLMs still struggle to generate human-level program plans, especially in normal and counterfactual activities. - Automatic evaluation metrics are proposed, including the Longest Common Subsequence (LCS) score and accuracy scoring based on the BLEURT model, to facilitate future research work. Overall, the paper aims to fill the gap in evaluating the planning capabilities of VLMs in multimodal household activity scenarios and provides a comprehensive benchmark dataset to advance research in the related field.

ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities

On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

Exploring and Benchmarking the Planning Capabilities of Large Language Models

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

Inner Monologue: Embodied Reasoning through Planning with Language Models

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

Evaluating Vision-Language Models as Evaluators in Path Planning

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models

RePLan: Robotic Replanning with Perception and Language Models

Open Grounded Planning: Challenges and Benchmark Construction

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs