Learning to Map Natural Language to General Purpose Source Code
Srini Iyer
Abstract:Learning to Map Natural Language to General Purpose Source Code Srinivasan Iyer Co-Chair of the Supervisory Committee: Associate Professor Luke Zettlemoyer Assistant Professor Alvin Cheung Computer Science and Engineering Models that automatically map natural language (NL) to source code in general purpose languages such as Java, Python, and SQL find utility amongst two main audiences viz. developers and non-expert users. For developers, they enable use-cases such as functioning as a NL assistant in programming IDEs, verifying the consistency of code documentation with code changes, and answering "how to" questions, for developers using new languages. For non-expert users, they enable use-cases of being able to communicate with databases, devices and applications, or of visualizing data, without having to learn to write computer programs. Developing these models is challenging because of contextual dependencies of the target code, the lack of alignment between NL and code tokens, syntactic and semantic requirements of the target code, and the prohibitively expensive cost of annotating training data. Furthermore, whilst developers can see and manipulate the generated code, non-expert users only see the output of execution, and therefore have the additional constraint of the generated code being exactly correct and executable. Finally, for users to trust models that automatically produce code, particularly in high-cost scenarios, it is important for models to provide an explanation of the generated code back to the user. This dissertation presents tasks, training methods/resources and new models for mapping NL to source code for both developers and non-expert users, and is divided into four parts. In the first part, we formalize the task of contextual code generation from NL for developers. We present ways to obtain inexpensive training datasets from large online code repositories, followed by methods to incorporate contextual awareness 3 into syntax-guided neural models to improve performance on the task. The second part shifts focus from developers to non-expert users, where we present methods to build NL interfaces that allow non-expert users to query databases by automatically mapping their NL requests to database SQL queries. Our methods are geared towards building deep learning models that improve in performance over time by leveraging user feedback and annotations obtained from crowd programmers, and open up inexpensive ways to build accurate NL interfaces for arbitrary database schemas. The third part of this dissertation presents the use of programmatic idioms as a means to significantly improve training time, as well as performance on both the NL to code tasks of parts 1 and 2. We discuss algorithms to extract frequently used programmatic idioms and train neural models to learn to apply them during code generation. Finally, we present models that describe the functionality of source code to users in NL as a first step towards building trustworthy language to code systems. Overall, this dissertation presents efficient deep learning models and training paradigms to map language to general purpose source code that will enable numerous applications for non-expert users as well as developers.
Computer Science