Abstract:Large Language Models (LLMs) can learn new tasks through in-context supervised learning (i.e., ICL). This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards. We show that a naive application of ICRL fails miserably, and identify the root cause as a fundamental deficiency at exploration, which leads to quick model degeneration. We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation. We use several challenging classification tasks to empirically show that our ICRL algorithms lead to effective learning from rewards alone, and analyze the characteristics of this ability and our methods. Overall, our results reveal remarkable ICRL abilities in LLMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: whether large - language models (LLMs) can learn new tasks through in - context reinforcement learning (ICRL). Specifically, the researchers explored whether LLMs can effectively learn only through past predictions and rewards in the absence of labeled data. The paper mentions that directly applying ICRL will lead to rapid model degradation, mainly because the model has fundamental flaws in exploration, causing the model to quickly predict only the same output. To solve this problem, the authors proposed several algorithms, including increasing the amount of computation at test time and a computationally - constrained approximation method, to improve the model's in - context exploration ability and learning effect. ### Main Contributions 1. **Problem Identification**: Point out that directly applying ICRL will lead to model degradation, mainly due to insufficient exploration. 2. **Proposed Solutions**: - **Exploratory ICRL**: Increase the model's exploration ability by introducing randomness into the context construction process. - **Approximate ICRL**: Reduce computational requirements while maintaining effective learning performance. 3. **Experimental Verification**: Through experiments on multiple classification tasks, the effectiveness of the proposed algorithms in improving ICRL performance was verified. ### Experimental Results - **Exploratory ICRL**: Significantly outperforms zero - shot and naive ICRL on all tasks and models, and its performance continues to improve as the amount of data increases. - **Approximate ICRL**: Its performance on the Llama model is close to that of Exploratory ICRL, but on the Phi model, more computational resources are required to achieve a similar effect. - **Computational Efficiency**: Approximate ICRL significantly reduces computational requirements, reducing the number of tokens processed by two orders of magnitude compared to Exploratory ICRL. ### Conclusion The paper shows that by introducing randomness and filtering negative reward signals, the problem of insufficient exploration in ICRL can be effectively solved, enabling LLMs to effectively learn only through reward signals. These methods not only improve the model's performance but also provide new directions for future research.

LLMs Are In-Context Reinforcement Learners

Large Language Models Know What Makes Exemplary Contexts

In-Context Language Learning: Architectures and Algorithms

Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs

ICLEval: Evaluating In-Context Learning Ability of Large Language Models

In-Context Learning with Reinforcement Learning for Incomplete Utterance Rewriting

Investigating the Learning Behaviour of In-Context Learning: A Comparison with Supervised Learning

Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning

Decoding In-Context Learning: Neuroscience-inspired Analysis of Representations in Large Language Models

In-Context Learning Learns Label Relationships but Is Not Conventional Learning

Zero-shot Model-based Reinforcement Learning using Large Language Models

Why Larger Language Models Do In-context Learning Differently?

What Do Language Models Learn in Context? The Structured Task Hypothesis

What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning

Is In-Context Learning in Large Language Models Bayesian? A Martingale Perspective

Improving In-Context Learning with Small Language Model Ensembles

Revisiting In-Context Learning with Long Context Language Models

LLMs Are Few-Shot In-Context Low-Resource Language Learners

In-Context Explainers: Harnessing LLMs for Explaining Black Box Models

From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

Long-context LLMs Struggle with Long In-context Learning