LLMs Are In-Context Reinforcement Learners

Giovanni Monea,Antoine Bosselut,Kianté Brantley,Yoav Artzi
2024-10-08
Abstract:Large Language Models (LLMs) can learn new tasks through in-context supervised learning (i.e., ICL). This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards. We show that a naive application of ICRL fails miserably, and identify the root cause as a fundamental deficiency at exploration, which leads to quick model degeneration. We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation. We use several challenging classification tasks to empirically show that our ICRL algorithms lead to effective learning from rewards alone, and analyze the characteristics of this ability and our methods. Overall, our results reveal remarkable ICRL abilities in LLMs.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: whether large - language models (LLMs) can learn new tasks through in - context reinforcement learning (ICRL). Specifically, the researchers explored whether LLMs can effectively learn only through past predictions and rewards in the absence of labeled data. The paper mentions that directly applying ICRL will lead to rapid model degradation, mainly because the model has fundamental flaws in exploration, causing the model to quickly predict only the same output. To solve this problem, the authors proposed several algorithms, including increasing the amount of computation at test time and a computationally - constrained approximation method, to improve the model's in - context exploration ability and learning effect. ### Main Contributions 1. **Problem Identification**: Point out that directly applying ICRL will lead to model degradation, mainly due to insufficient exploration. 2. **Proposed Solutions**: - **Exploratory ICRL**: Increase the model's exploration ability by introducing randomness into the context construction process. - **Approximate ICRL**: Reduce computational requirements while maintaining effective learning performance. 3. **Experimental Verification**: Through experiments on multiple classification tasks, the effectiveness of the proposed algorithms in improving ICRL performance was verified. ### Experimental Results - **Exploratory ICRL**: Significantly outperforms zero - shot and naive ICRL on all tasks and models, and its performance continues to improve as the amount of data increases. - **Approximate ICRL**: Its performance on the Llama model is close to that of Exploratory ICRL, but on the Phi model, more computational resources are required to achieve a similar effect. - **Computational Efficiency**: Approximate ICRL significantly reduces computational requirements, reducing the number of tokens processed by two orders of magnitude compared to Exploratory ICRL. ### Conclusion The paper shows that by introducing randomness and filtering negative reward signals, the problem of insufficient exploration in ICRL can be effectively solved, enabling LLMs to effectively learn only through reward signals. These methods not only improve the model's performance but also provide new directions for future research.