Abstract:Program synthesis aims to create accurate, executable programs from problem specifications, specifically from natural language descriptions in our context. Recent studies have leveraged the power of reinforcement learning (RL) in conjunction with large language models (LLMs), significantly enhancing code generation capabilities. The application of RL focuses on directly optimizing for functional correctness, offering an advantage over conventional supervised methods. Despite policy-based RL methods dominating the literature on RL for program synthesis, the nature of program synthesis tasks hints at a natural alignment with value-based methods. This stems from the rich collection of off-policy programs, including those developed by human programmers and also historical samples, coupled with the straightforward verification of generated programs through automated unit testing, meaning rewards are easy to obtain. Diverging from the dominant use of policy-based algorithms, our work explores the feasibility of value-based approaches, leading to the development of our $\mathcal{B}$-Coder (pronounced Bellman coder). Yet, training value-based methods presents challenges due to the enormous search space inherent to program synthesis. To this end, we introduce an initialization protocol for RL agents utilizing pre-trained LMs and a conservative Bellman operator to reduce training complexities. Moreover, we demonstrate how to leverage the learned value functions as a dual strategy to post-process generated programs. Our empirical evaluations demonstrated $\mathcal{B}$-Coder's capability in achieving state-of-the-art performance when compared to policy-based methods. Remarkably, this achievement is reached with minimal reward engineering effort, highlighting the effectiveness of value-based RL, independent of reward designs.

What problem does this paper attempt to address?

The paper primarily addresses the challenges in the field of program synthesis, particularly how to automatically create accurate and executable programs from natural language descriptions. The authors propose a value-based reinforcement learning method called B-Coder, aimed at overcoming the limitations of existing policy-based methods in program synthesis tasks. Specifically, the paper attempts to solve the following core issues: 1. **Leveraging the advantages of value functions**: Compared to traditional policy gradient methods, value function methods can more effectively utilize off-policy data, including human-written programs and historical samples. These data are abundant in program synthesis tasks and can easily obtain reward signals through automated unit testing. Therefore, value function methods can better utilize these data to optimize the generated programs. 2. **Overcoming training challenges**: Although value function methods theoretically have advantages, training in the large-scale state-action space of program synthesis faces convergence difficulties. To address this challenge, the paper proposes an initialization protocol and a conservative Bellman operator to stabilize the training process and reduce complexity. 3. **Minimizing reward engineering efforts**: The paper also demonstrates how to achieve strong empirical performance by minimizing reliance on reward design. This means that even under simple reward structures, B-Coder can perform well, further revealing the effectiveness of reinforcement learning algorithm design independent of reward function design. In summary, the main contribution of the paper is the development of a new value function method—B-Coder, which not only effectively utilizes abundant off-policy data resources but also stabilizes training in large-scale program synthesis tasks and exhibits high flexibility in reward design.

$\mathcal{B}$-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis