Abstract:Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs' potential to supervise tasks that cannot be scored by the final state alone. To this end, we introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. Its novel hierarchical structure -- basic single-step tasks composed into more and more complex sequential tasks -- allows a fine-grained understanding of how well VLMs can judge tasks with varying complexity. To illustrate this, we use ViSTa to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o. We find that, while they are all good at object recognition, they fail to understand sequential tasks, with only GPT-4o achieving non-trivial performance.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **Can vision - language models (VLMs) understand and evaluate sequential tasks?** Specifically, existing vision - language models have been used as reward models in reinforcement learning, mainly for goal - oriented tasks, that is, agents need to reach specific final results. However, it is not clear how these models perform for tasks that cannot be scored only by the final state (such as sequential tasks involving multiple steps). To explore this issue, the author introduced a dataset named ViSTa, which is specifically used to evaluate the ability to understand vision - based sequential tasks. ### Main research questions: 1. **Can vision - language models effectively supervise and evaluate sequential tasks?** - The author hopes to test the performance of vision - language models on sequential tasks of different complexities through the ViSTa dataset, especially whether they can understand the sequential nature of tasks. 2. **How do existing models perform in different environments?** - The ViSTa dataset covers three different environments: Virtual Home, Minecraft, and the real world. The author hopes to evaluate the performance differences of these models in these three environments. 3. **How do models perform in different types of tasks?** - The dataset includes single - action tasks and multi - action tasks. The author hopes to test the models' abilities in object recognition, attribute recognition, and action understanding through these tasks. ### Research motivation: Using vision - language models as reward models in reinforcement learning can reduce costs and improve safety. However, currently, these models are mainly applied to goal - oriented tasks, and their support for sequential tasks is still relatively limited. Therefore, the author hopes to systematically evaluate the performance of these models on sequential tasks through the ViSTa dataset to determine whether they have the potential to become reliable task supervisors. ### Conclusion: Through experiments on the ViSTa dataset, the author found that although existing vision - language models perform well in object recognition, they still have great difficulties in understanding and evaluating sequential tasks, especially in complex multi - step tasks. This indicates that current vision - language models are not yet fully competent for the supervision of sequential tasks.

ViSTa Dataset: Do vision-language models understand sequential tasks?

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Vision-Language Models as a Source of Rewards

Vision Language Models are In-Context Value Learners

Vision-Language Models for Vision Tasks: A Survey

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Reinforcement Learning Friendly Vision-Language Model for Minecraft

Can 3D Vision-Language Models Truly Understand Natural Language?

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models

Do Pre-trained Vision-Language Models Encode Object States?

Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

Vision language models are blind

Are VLMs Really Blind

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Code as Reward: Empowering Reinforcement Learning with VLMs