ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities

Ying Su,Zhan Ling,Haochen Shi,Jiayang Cheng,Yauwai Yim,Yangqiu Song
2024-10-05
Abstract:Large language models~(LLMs) have been adopted to process textual task description and accomplish procedural planning in embodied AI tasks because of their powerful reasoning ability. However, there is still lack of study on how vision language models~(VLMs) behave when multi-modal task inputs are considered. Counterfactual planning that evaluates the model's reasoning ability over alternative task situations are also under exploited. In order to evaluate the planning ability of both multi-modal and counterfactual aspects, we propose ActPlan-1K. ActPlan-1K is a multi-modal planning benchmark constructed based on ChatGPT and household activity simulator iGibson2. The benchmark consists of 153 activities and 1,187 instances. Each instance describing one activity has a natural language task description and multiple environment images from the simulator. The gold plan of each instance is action sequences over the objects in provided scenes. Both the correctness and commonsense satisfaction are evaluated on typical VLMs. It turns out that current VLMs are still struggling at generating human-level procedural plans for both normal activities and counterfactual activities. We further provide automatic evaluation metrics by finetuning over BLEURT model to facilitate future research on our benchmark.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of evaluating the program planning capabilities of Visual Language Models (VLMs) under multimodal task inputs in household activities. Specifically: - Current Large Language Models (LLMs) have demonstrated strong reasoning abilities in processing text task descriptions and generating program plans, but there is insufficient research on the performance of Visual Language Models (VLMs) under multimodal task inputs. - The paper proposes a new benchmark dataset, ActPlan-1K, which is constructed based on ChatGPT and the household activity simulator iGibson2. It includes 153 activities and 1187 instances, with each instance comprising natural language task descriptions and multiple environment images. - By evaluating the action sequence plans generated by VLMs, it is found that current VLMs still struggle to generate human-level program plans, especially in normal and counterfactual activities. - Automatic evaluation metrics are proposed, including the Longest Common Subsequence (LCS) score and accuracy scoring based on the BLEURT model, to facilitate future research work. Overall, the paper aims to fill the gap in evaluating the planning capabilities of VLMs in multimodal household activity scenarios and provides a comprehensive benchmark dataset to advance research in the related field.