Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation

Wenqing Zheng,S P Sharan,Ajay Kumar Jaiswal,Kevin Wang,Yihan Xi,Dejia Xu,Zhangyang Wang

2023-07-19

Abstract:For a complicated algorithm, its implementation by a human programmer usually starts with outlining a rough control flow followed by iterative enrichments, eventually yielding carefully generated syntactic structures and variables in a hierarchy. However, state-of-the-art large language models generate codes in a single pass, without intermediate warm-ups to reflect the structured thought process of "outline-then-detail". Inspired by the recent success of chain-of-thought prompting, we propose ChainCoder, a program synthesis language model that generates Python code progressively, i.e. from coarse to fine in multiple passes. We first decompose source code into layout frame components and accessory components via abstract syntax tree parsing to construct a hierarchical representation. We then reform our prediction target into a multi-pass objective, each pass generates a subsequence, which is concatenated in the hierarchy. Finally, a tailored transformer architecture is leveraged to jointly encode the natural language descriptions and syntactically aligned I/O data samples. Extensive evaluations show that ChainCoder outperforms state-of-the-arts, demonstrating that our progressive generation eases the reasoning procedure and guides the language model to generate higher-quality solutions. Our codes are available at: <a class="link-external link-https" href="https://github.com/VITA-Group/ChainCoder" rel="external noopener nofollow">this https URL</a>.

Programming Languages,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper aims to address two main issues in automatic program synthesis: 1. **Lack of intermediate steps in the generation process**: Current large-scale language models (LLMs) typically generate code in a single autoregressive manner without considering logical complexity. This approach overlooks the process programmers usually follow, which involves first constructing a rough control flow framework and then gradually refining it. The paper proposes a multi-stage reasoning strategy to improve code generation quality by progressively refining the generation process. 2. **Syntax structure is ignored**: Existing LLMs use tokenizers primarily designed for natural language when processing code, which do not fully account for the strict syntax structures unique to programming languages. The paper introduces a tokenization method based on abstract syntax trees (AST), breaking down the code into layout frameworks and auxiliary components to better capture the syntax information of the code. Specifically, the paper proposes a new model named ChainCoder, which can generate code step by step, constructing the code structure from coarse to fine, thereby improving the logical coherence and accuracy of the generated code. Additionally, the paper designs a specialized Transformer architecture for encoding natural language descriptions and aligning input-output data samples, further enhancing the model's expressiveness. Experimental results show that ChainCoder outperforms existing advanced models in multiple benchmark tests.

Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation

Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation

CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

StructCoder: Structure-Aware Transformer for Code Generation

AI Chain on Large Language Model for Unsupervised Control Flow Graph Generation for Statically-Typed Partial Code

Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for Code Generation

PanGu-Coder: Program Synthesis with Function-Level Language Modeling

Demo2Code: From Summarizing Demonstrations to Synthesizing Code via Extended Chain-of-Thought

Planning with Large Language Models for Code Generation

StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback

UniCoder: Scaling Code Large Language Model via Universal Code

CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Enabling Programming Thinking in Large Language Models Toward Code Generation

Fine-grained Pseudo-code Generation Method via Code Feature Extraction and Transformer

Execution-based Code Generation using Deep Reinforcement Learning

StepCoder: Improving Code Generation with Reinforcement Learning from Compiler Feedback

Compilable Neural Code Generation with Compiler Feedback

Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation

JumpCoder: Go Beyond Autoregressive Coder via Online Modification