Abstract:Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.

What problem does this paper attempt to address?

The paper attempts to address the problem of enhancing the decision-making capabilities of large-scale Vision-Language Models (VLMs) in multi-step goal-oriented tasks through Reinforcement Learning (RL). Specifically, while existing VLMs exhibit strong language reasoning abilities in various scenarios, these models are primarily fine-tuned through supervised learning and lack the ability to interact with the environment. As a result, they perform poorly in multi-step interactive environments that require both visual recognition and language understanding. To overcome this challenge, the paper proposes an algorithmic framework that directly fine-tunes VLMs using reinforcement learning, enabling them to make better decisions in multi-step goal-oriented tasks. ### Main Contributions: 1. **Algorithmic Framework**: A new algorithmic framework is proposed that can directly fine-tune VLMs using reinforcement learning, enhancing their decision-making capabilities in multi-step goal-oriented tasks. 2. **Chain-of-Thought (CoT) Reasoning**: The introduction of the CoT reasoning mechanism allows VLMs to generate intermediate reasoning steps, thereby more efficiently exploring the final textual actions. 3. **Experimental Validation**: The effectiveness of the method is validated through experiments on multiple tasks, particularly in visual semantic understanding and fine-grained visual recognition tasks, where a 7B parameter model outperforms commercial models such as GPT4-V and Gemini. ### Method Overview: - **Input and Output**: At each time step, the VLM receives the current observation and preset prompts as input and outputs expressions containing chain-of-thought reasoning and textual actions. - **Environment Interaction**: The textual actions are parsed into executable actions that interact with the environment to obtain task rewards. - **Reinforcement Learning Fine-Tuning**: Task rewards are used to fine-tune the entire VLM through reinforcement learning, improving its decision-making capabilities. ### Experimental Results: - **Task Performance**: The method significantly improves the decision-making capabilities of VLMs in multiple tasks within the gym_cards and alfworld domains. - **Importance of CoT Reasoning**: Experiments demonstrate that CoT reasoning is a key factor in performance improvement, and removing CoT reasoning leads to a significant drop in overall performance. In summary, the paper effectively enhances the decision-making capabilities of VLMs in multi-step goal-oriented tasks by introducing reinforcement learning and chain-of-thought reasoning mechanisms.

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Empowering Vision-Language Models for Reasoning Ability Through Large Language Models

On the Modeling Capabilities of Large Language Models for Sequential Decision Making

Enhancing Advanced Visual Reasoning Ability of Large Language Models

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Improve Vision Language Model Chain-of-thought Reasoning

Vision-Language Models as a Source of Rewards

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Vision-Language Models Can Self-Improve Reasoning Via Reflection

Integrating Large Language Models and Reinforcement Learning for Non-Linear Reasoning

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Teaching Large Language Models to Reason with Reinforcement Learning

Large Language Models are Visual Reasoning Coordinators

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Enhance Reasoning Ability of Visual-Language Models via Large Language Models

How to Build a Pre-trained Multimodal model for Simultaneously Chatting and Decision-making?

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains