Abstract:Learning to Map Natural Language to General Purpose Source Code Srinivasan Iyer Co-Chair of the Supervisory Committee: Associate Professor Luke Zettlemoyer Assistant Professor Alvin Cheung Computer Science and Engineering Models that automatically map natural language (NL) to source code in general purpose languages such as Java, Python, and SQL find utility amongst two main audiences viz. developers and non-expert users. For developers, they enable use-cases such as functioning as a NL assistant in programming IDEs, verifying the consistency of code documentation with code changes, and answering "how to" questions, for developers using new languages. For non-expert users, they enable use-cases of being able to communicate with databases, devices and applications, or of visualizing data, without having to learn to write computer programs. Developing these models is challenging because of contextual dependencies of the target code, the lack of alignment between NL and code tokens, syntactic and semantic requirements of the target code, and the prohibitively expensive cost of annotating training data. Furthermore, whilst developers can see and manipulate the generated code, non-expert users only see the output of execution, and therefore have the additional constraint of the generated code being exactly correct and executable. Finally, for users to trust models that automatically produce code, particularly in high-cost scenarios, it is important for models to provide an explanation of the generated code back to the user. This dissertation presents tasks, training methods/resources and new models for mapping NL to source code for both developers and non-expert users, and is divided into four parts. In the first part, we formalize the task of contextual code generation from NL for developers. We present ways to obtain inexpensive training datasets from large online code repositories, followed by methods to incorporate contextual awareness 3 into syntax-guided neural models to improve performance on the task. The second part shifts focus from developers to non-expert users, where we present methods to build NL interfaces that allow non-expert users to query databases by automatically mapping their NL requests to database SQL queries. Our methods are geared towards building deep learning models that improve in performance over time by leveraging user feedback and annotations obtained from crowd programmers, and open up inexpensive ways to build accurate NL interfaces for arbitrary database schemas. The third part of this dissertation presents the use of programmatic idioms as a means to significantly improve training time, as well as performance on both the NL to code tasks of parts 1 and 2. We discuss algorithms to extract frequently used programmatic idioms and train neural models to learn to apply them during code generation. Finally, we present models that describe the functionality of source code to users in NL as a first step towards building trustworthy language to code systems. Overall, this dissertation presents efficient deep learning models and training paradigms to map language to general purpose source code that will enable numerous applications for non-expert users as well as developers.

More than a framework: Sketching out technical enablers for natural language-based source code generation

Recent Advances in Intelligent Source Code Generation: A Survey on Natural Language Based Studies

Improving Natural Language Capability of Code Large Language Model

CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Learning to Map Natural Language to General Purpose Source Code

CodeS: Natural Language to Code Repository via Multi-Layer Sketch

Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review

Natural Language-Guided Programming

Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges

Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for Code Generation

Investigating the Use of Natural Language Processing for Automated Code Generation

Progress of Code Naturalness and Its Application

Deep Learning for Source Code Modeling and Generation

Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation

A + B: A General Generator-Reader Framework for Optimizing LLMs to Unleash Synergy Potential

Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code

SP-NLG: A Semantic-Parsing-Guided Natural Language Generation Framework

Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

Neural Models for Source Code Synthesis and Completion

No Man is an Island: Towards Fully Automatic Programming by Code Search, Code Generation and Program Repair