Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Yuexiang Zhai,Hao Bai,Zipeng Lin,Jiayi Pan,Shengbang Tong,Yifei Zhou,Alane Suhr,Saining Xie,Yann LeCun,Yi Ma,Sergey Levine
2024-10-08
Abstract:Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of enhancing the decision-making capabilities of large-scale Vision-Language Models (VLMs) in multi-step goal-oriented tasks through Reinforcement Learning (RL). Specifically, while existing VLMs exhibit strong language reasoning abilities in various scenarios, these models are primarily fine-tuned through supervised learning and lack the ability to interact with the environment. As a result, they perform poorly in multi-step interactive environments that require both visual recognition and language understanding. To overcome this challenge, the paper proposes an algorithmic framework that directly fine-tunes VLMs using reinforcement learning, enabling them to make better decisions in multi-step goal-oriented tasks. ### Main Contributions: 1. **Algorithmic Framework**: A new algorithmic framework is proposed that can directly fine-tune VLMs using reinforcement learning, enhancing their decision-making capabilities in multi-step goal-oriented tasks. 2. **Chain-of-Thought (CoT) Reasoning**: The introduction of the CoT reasoning mechanism allows VLMs to generate intermediate reasoning steps, thereby more efficiently exploring the final textual actions. 3. **Experimental Validation**: The effectiveness of the method is validated through experiments on multiple tasks, particularly in visual semantic understanding and fine-grained visual recognition tasks, where a 7B parameter model outperforms commercial models such as GPT4-V and Gemini. ### Method Overview: - **Input and Output**: At each time step, the VLM receives the current observation and preset prompts as input and outputs expressions containing chain-of-thought reasoning and textual actions. - **Environment Interaction**: The textual actions are parsed into executable actions that interact with the environment to obtain task rewards. - **Reinforcement Learning Fine-Tuning**: Task rewards are used to fine-tune the entire VLM through reinforcement learning, improving its decision-making capabilities. ### Experimental Results: - **Task Performance**: The method significantly improves the decision-making capabilities of VLMs in multiple tasks within the gym_cards and alfworld domains. - **Importance of CoT Reasoning**: Experiments demonstrate that CoT reasoning is a key factor in performance improvement, and removing CoT reasoning leads to a significant drop in overall performance. In summary, the paper effectively enhances the decision-making capabilities of VLMs in multi-step goal-oriented tasks by introducing reinforcement learning and chain-of-thought reasoning mechanisms.