ViSTa Dataset: Do vision-language models understand sequential tasks?

Evžen Wybitul,Evan Ryan Gunter,Mikhail Seleznyov,David Lindner
2024-11-22
Abstract:Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs' potential to supervise tasks that cannot be scored by the final state alone. To this end, we introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. Its novel hierarchical structure -- basic single-step tasks composed into more and more complex sequential tasks -- allows a fine-grained understanding of how well VLMs can judge tasks with varying complexity. To illustrate this, we use ViSTa to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o. We find that, while they are all good at object recognition, they fail to understand sequential tasks, with only GPT-4o achieving non-trivial performance.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **Can vision - language models (VLMs) understand and evaluate sequential tasks?** Specifically, existing vision - language models have been used as reward models in reinforcement learning, mainly for goal - oriented tasks, that is, agents need to reach specific final results. However, it is not clear how these models perform for tasks that cannot be scored only by the final state (such as sequential tasks involving multiple steps). To explore this issue, the author introduced a dataset named ViSTa, which is specifically used to evaluate the ability to understand vision - based sequential tasks. ### Main research questions: 1. **Can vision - language models effectively supervise and evaluate sequential tasks?** - The author hopes to test the performance of vision - language models on sequential tasks of different complexities through the ViSTa dataset, especially whether they can understand the sequential nature of tasks. 2. **How do existing models perform in different environments?** - The ViSTa dataset covers three different environments: Virtual Home, Minecraft, and the real world. The author hopes to evaluate the performance differences of these models in these three environments. 3. **How do models perform in different types of tasks?** - The dataset includes single - action tasks and multi - action tasks. The author hopes to test the models' abilities in object recognition, attribute recognition, and action understanding through these tasks. ### Research motivation: Using vision - language models as reward models in reinforcement learning can reduce costs and improve safety. However, currently, these models are mainly applied to goal - oriented tasks, and their support for sequential tasks is still relatively limited. Therefore, the author hopes to systematically evaluate the performance of these models on sequential tasks through the ViSTa dataset to determine whether they have the potential to become reliable task supervisors. ### Conclusion: Through experiments on the ViSTa dataset, the author found that although existing vision - language models perform well in object recognition, they still have great difficulties in understanding and evaluating sequential tasks, especially in complex multi - step tasks. This indicates that current vision - language models are not yet fully competent for the supervision of sequential tasks.