Efficient Exploration for LLMs

Vikranth Dwaracherla,Seyed Mohammad Asghari,Botao Hao,Benjamin Van Roy
2024-06-05
Abstract:We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles.
Machine Learning,Artificial Intelligence,Computation and Language,Methodology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the efficiency of large - language models (LLMs) in learning from human feedback through efficient exploration strategies. Specifically, the paper explores how to reduce the number of required queries through active exploration strategies in the process of collecting human feedback to improve large - language models, thereby reaching a higher performance level more quickly. The authors propose several different exploration algorithms, including passive exploration, Boltzmann exploration, information maximization (infomax), and double Thompson sampling, and experimentally compare the effects of these algorithms. ### Main problems 1. **Improving learning efficiency**: How can the amount of data required for large - language models to learn from human feedback be reduced through effective exploration strategies? 2. **Optimizing exploration strategies**: Which exploration strategies can most effectively accelerate the learning process and reach a higher performance level? 3. **Evaluating the quality of uncertainty estimation**: How can uncertainty estimation be evaluated and utilized to improve exploration strategies? ### Experimental design - **Dataset**: The Anthropic Helpfulness Base training and evaluation dataset was used. - **Model**: Based on the Gemini Nano and Gemini Pro pre - trained language models. - **Exploration algorithms**: - Passive exploration: Randomly select response pairs. - Boltzmann exploration: Select response pairs based on the point estimate of the reward model. - Information maximization (infomax): Use uncertainty estimation to select response pairs with the most information. - Double Thompson sampling: Combine uncertainty estimation to select potentially optimal response pairs. ### Experimental results - **Performance improvement**: Double Thompson sampling (double TS) performs best among all exploration strategies and can reach a high performance level with a relatively small number of queries. - **Uncertainty estimation**: Double Thompson sampling significantly improves learning efficiency by using uncertainty estimation. - **Reduction in data requirements**: Compared with passive exploration, double Thompson sampling can reduce data requirements by an order of magnitude. ### Conclusion The paper experimentally proves that efficient exploration strategies (especially double Thompson sampling) can significantly reduce the number of queries required for large - language models to learn from human feedback, thereby accelerating the learning process and reaching a higher performance level. This provides an important reference for future research, especially in dealing with large - scale data and improving model performance.