Abstract:We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the efficiency of large - language models (LLMs) in learning from human feedback through efficient exploration strategies. Specifically, the paper explores how to reduce the number of required queries through active exploration strategies in the process of collecting human feedback to improve large - language models, thereby reaching a higher performance level more quickly. The authors propose several different exploration algorithms, including passive exploration, Boltzmann exploration, information maximization (infomax), and double Thompson sampling, and experimentally compare the effects of these algorithms. ### Main problems 1. **Improving learning efficiency**: How can the amount of data required for large - language models to learn from human feedback be reduced through effective exploration strategies? 2. **Optimizing exploration strategies**: Which exploration strategies can most effectively accelerate the learning process and reach a higher performance level? 3. **Evaluating the quality of uncertainty estimation**: How can uncertainty estimation be evaluated and utilized to improve exploration strategies? ### Experimental design - **Dataset**: The Anthropic Helpfulness Base training and evaluation dataset was used. - **Model**: Based on the Gemini Nano and Gemini Pro pre - trained language models. - **Exploration algorithms**: - Passive exploration: Randomly select response pairs. - Boltzmann exploration: Select response pairs based on the point estimate of the reward model. - Information maximization (infomax): Use uncertainty estimation to select response pairs with the most information. - Double Thompson sampling: Combine uncertainty estimation to select potentially optimal response pairs. ### Experimental results - **Performance improvement**: Double Thompson sampling (double TS) performs best among all exploration strategies and can reach a high performance level with a relatively small number of queries. - **Uncertainty estimation**: Double Thompson sampling significantly improves learning efficiency by using uncertainty estimation. - **Reduction in data requirements**: Compared with passive exploration, double Thompson sampling can reduce data requirements by an order of magnitude. ### Conclusion The paper experimentally proves that efficient exploration strategies (especially double Thompson sampling) can significantly reduce the number of queries required for large - language models to learn from human feedback, thereby accelerating the learning process and reaching a higher performance level. This provides an important reference for future research, especially in dealing with large - scale data and improving model performance.

Efficient Exploration for LLMs

EVOLvE: Evaluating and Optimizing LLMs For Exploration

Query Expansion by Prompting Large Language Models

Fine-grained LLM Agent: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback

Can large language models explore in-context?

Choices are More Important than Efforts: LLM Enables Efficient Multi-Agent Exploration

Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization

ExpeL: LLM Agents Are Experiential Learners

Introspective Tips: Large Language Model for In-Context Decision Making

Efficient Reinforcement Learning with Large Language Model Priors

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models

Leveraging Large Language Models for Exploiting ASR Uncertainty

Query-Efficient Planning with Language Models

LLaMA Rider: Spurring Large Language Models to Explore the Open World

Efficient Sequential Decision Making with Large Language Models

AgentBench: Evaluating LLMs as Agents

Large Language Models As Evolution Strategies

Leveraging Large Language Models for Tradespace Exploration

Supervised Knowledge Makes Large Language Models Better In-context Learners