Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation

Wenqing Zheng,S P Sharan,Ajay Kumar Jaiswal,Kevin Wang,Yihan Xi,Dejia Xu,Zhangyang Wang
2023-07-19
Abstract:For a complicated algorithm, its implementation by a human programmer usually starts with outlining a rough control flow followed by iterative enrichments, eventually yielding carefully generated syntactic structures and variables in a hierarchy. However, state-of-the-art large language models generate codes in a single pass, without intermediate warm-ups to reflect the structured thought process of "outline-then-detail". Inspired by the recent success of chain-of-thought prompting, we propose ChainCoder, a program synthesis language model that generates Python code progressively, i.e. from coarse to fine in multiple passes. We first decompose source code into layout frame components and accessory components via abstract syntax tree parsing to construct a hierarchical representation. We then reform our prediction target into a multi-pass objective, each pass generates a subsequence, which is concatenated in the hierarchy. Finally, a tailored transformer architecture is leveraged to jointly encode the natural language descriptions and syntactically aligned I/O data samples. Extensive evaluations show that ChainCoder outperforms state-of-the-arts, demonstrating that our progressive generation eases the reasoning procedure and guides the language model to generate higher-quality solutions. Our codes are available at: <a class="link-external link-https" href="https://github.com/VITA-Group/ChainCoder" rel="external noopener nofollow">this https URL</a>.
Programming Languages,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address two main issues in automatic program synthesis: 1. **Lack of intermediate steps in the generation process**: Current large-scale language models (LLMs) typically generate code in a single autoregressive manner without considering logical complexity. This approach overlooks the process programmers usually follow, which involves first constructing a rough control flow framework and then gradually refining it. The paper proposes a multi-stage reasoning strategy to improve code generation quality by progressively refining the generation process. 2. **Syntax structure is ignored**: Existing LLMs use tokenizers primarily designed for natural language when processing code, which do not fully account for the strict syntax structures unique to programming languages. The paper introduces a tokenization method based on abstract syntax trees (AST), breaking down the code into layout frameworks and auxiliary components to better capture the syntax information of the code. Specifically, the paper proposes a new model named ChainCoder, which can generate code step by step, constructing the code structure from coarse to fine, thereby improving the logical coherence and accuracy of the generated code. Additionally, the paper designs a specialized Transformer architecture for encoding natural language descriptions and aligning input-output data samples, further enhancing the model's expressiveness. Experimental results show that ChainCoder outperforms existing advanced models in multiple benchmark tests.