Abstract:Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision-making tasks. However, few studies have thoroughly investigated the impact on LLM agents capabilities of fine-tuning them with RL in a specific environment. In this paper, we propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment. Our findings reveal that the performance of LLMs degrades when faced with prompt formulations different from those used during the RL training phase. Besides, we analyze the source of this sensitivity by examining the model's internal representations and salient tokens. Finally, we propose to use a contrastive loss to mitigate this sensitivity and improve the robustness and generalization capabilities of LLMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the sensitivity and over - fitting phenomenon of large language models (LLMs) to different prompts after fine - tuning through reinforcement learning (RL). Specifically, the research found that LLMs trained by RL in a specific environment will experience a significant performance decline when faced with prompt forms different from those during training. This phenomenon is called "prompt overfitting". The main objective of the paper is to analyze the causes of this overfitting and propose solutions to improve the generalization ability and robustness of LLMs. ### Main research questions: 1. **Prompt sensitivity**: How sensitive are LLMs to different prompt forms? How does this sensitivity affect their generalization ability in various prompt forms? 2. **State representation**: How do LLMs encode the state space in their hidden representations? What is the topological structure of these representations? 3. **The influence of prompt information on action selection**: After fine - tuning with multiple prompts, which parts of the prompts do LLMs focus on when completing tasks? ### Solutions: - **Contrastive learning loss**: To alleviate prompt overfitting, the paper proposes a contrastive learning loss, aiming to make the hidden representations of LLMs invariant to different prompt forms. In this way, zero - shot performance and robustness to prompt changes can be improved, and at the same time, the model's ability to acquire new knowledge in the environment can be enhanced. ### Experimental setup: - **Environment**: The experiments were carried out in two text environments: BabyAI - Text and TWC - Medium. - **Prompt design**: Four different prompt forms (P0, P1, P2, P3) were defined, each providing different combinations of goals, possible actions, inventories, and text observations. - **Training and evaluation**: Multiple LLMs (such as Flan - T5, GPT - Neo, etc.) were used for training, and the performance under different prompt forms was evaluated during training and testing respectively. ### Main findings: - **Prompt sensitivity**: LLMs without fine - tuning have poor performance in zero - shot scenarios, while LLMs fine - tuned with a single prompt form perform well under the same prompt form but have a significant performance decline under other prompt forms. - **State representation**: LLMs tend to cluster prompts according to prompt forms rather than the content itself, which further confirms the prompt overfitting phenomenon. - **The importance of prompt information**: LLMs focus on different parts of the prompts under different prompt forms, which is related to the performance changes. ### Conclusion: Through detailed experiments and analysis, the paper reveals the sensitivity and over - fitting phenomenon of LLMs to prompt forms after RL fine - tuning, and proposes a contrastive learning loss to alleviate this problem. This not only improves the generalization ability of LLMs under different prompt forms but also enhances the robustness and adaptability of the model in the interactive environment.

Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting

Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint

StablePrompt: Automatic Prompt Tuning using Reinforcement Learning for Large Language Models

Words as Beacons: Guiding RL Agents with High-Level Language Prompts

Reinforcement Learning Problem Solving with Large Language Models

Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

On the Modeling Capabilities of Large Language Models for Sequential Decision Making

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Aligning Large Language Models via Fine-grained Supervision

Teaching Large Language Models to Reason with Reinforcement Learning

Guiding Pretraining in Reinforcement Learning with Large Language Models

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning

Accelerating Reinforcement Learning of Robotic Manipulations via Feedback from Large Language Models

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Confronting Reward Model Overoptimization with Constrained RLHF

On the Effects of Fine-tuning Language Models for Text-Based Reinforcement Learning

Efficient Reinforcement Learning with Large Language Model Priors