AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code Generation

Zhensu Sun,Xiaoning Du,Zhou Yang,Li Li,David Lo
2024-08-14
Abstract:Artificial Intelligence (AI) models have emerged as another important audience for programming languages alongside humans and machines, as we enter the era of large language models (LLMs). LLMs can now perform well in coding competitions and even write programs like developers to solve various tasks, including mathematical problems. However, the grammar and layout of current programs are designed to cater the needs of human developers -- with many grammar tokens and formatting tokens being used to make the code easier for humans to read. While this is helpful, such a design adds unnecessary computational work for LLMs, as each token they either use or produce consumes computational resources. To improve inference efficiency and reduce computational costs, we propose the concept of AI-oriented grammar. This aims to represent code in a way that better suits the working mechanism of AI models. Code written with AI-oriented grammar discards formats and uses a minimum number of tokens to convey code semantics effectively. To demonstrate the feasibility of this concept, we explore and implement the first AI-oriented grammar for Python, named SimPy. SimPy is crafted by revising the original Python grammar through a series of heuristic rules. Programs written in SimPy maintain identical AST structures to those in standard Python. This allows for not only execution via a modified AST parser, but also seamless transformation between programs written in Python and SimPy, enabling human developers and LLMs to use Python and SimPy, respectively, when they need to collaborate. In the experiments, compared with Python, SimPy enables a reduction in token usage by 13.5% and 10.4% for CodeLlama and GPT-4, respectively, when completing the same set of code-related tasks. Additionally, these models can maintain or even improve their performance when using SimPy instead of Python for these tasks.
Software Engineering,Artificial Intelligence,Programming Languages
What problem does this paper attempt to address?
The paper primarily explores the issue of redesigning programming language syntax to improve code generation efficiency for artificial intelligence models, particularly large language models (LLMs). ### Research Background and Objectives With the development of large language models, these models can not only participate in coding competitions but also write programs to solve various tasks like human developers. However, the existing programming language syntax is mainly designed to meet the needs of human developers, containing many symbols and formats to enhance code readability, which is an additional computational burden for LLMs. ### Main Contributions 1. **Proposing the Concept of AI-Oriented Syntax**: The researchers proposed the concept of AI-oriented syntax, aiming to design a syntax structure more suitable for LLMs. This syntax minimizes the number of symbols, retaining only the parts crucial for expressing code semantics. 2. **Implementing the First AI-Oriented Python Syntax - Simple Python (SimPy)**: By modifying the original Python syntax through a series of heuristic rules, the researchers implemented Simple Python (SimPy), a simplified version of Python syntax. Programs written in SimPy have the same abstract syntax tree (AST) structure as standard Python, allowing SimPy programs to be parsed and executed and seamlessly converted with standard Python programs. 3. **Experimental Validation**: In experiments, compared to standard Python, SimPy significantly reduced the number of tokens used, for example, reducing token usage by 13.5% and 10.4% for CodeLlama and GPT-4, respectively. Additionally, when using SimPy, the performance of these models could be maintained or even improved. 4. **Support for Practical Application Scenarios**: To extend the application range of AI-oriented syntax, the researchers also proposed a code generation framework called DualCode. This framework ensures that users can continue to use human-readable code while models can benefit from the efficiency of AI-oriented syntax through a rule-based converter. ### Conclusion This paper proposes a new syntax design concept—AI-oriented syntax—and demonstrates its feasibility and potential value by implementing Simple Python. This provides new ideas for further optimizing programming language design, especially considering the efficiency of artificial intelligence models.