Abstract:Learning to Map Natural Language to General Purpose Source Code Srinivasan Iyer Co-Chair of the Supervisory Committee: Associate Professor Luke Zettlemoyer Assistant Professor Alvin Cheung Computer Science and Engineering Models that automatically map natural language (NL) to source code in general purpose languages such as Java, Python, and SQL find utility amongst two main audiences viz. developers and non-expert users. For developers, they enable use-cases such as functioning as a NL assistant in programming IDEs, verifying the consistency of code documentation with code changes, and answering "how to" questions, for developers using new languages. For non-expert users, they enable use-cases of being able to communicate with databases, devices and applications, or of visualizing data, without having to learn to write computer programs. Developing these models is challenging because of contextual dependencies of the target code, the lack of alignment between NL and code tokens, syntactic and semantic requirements of the target code, and the prohibitively expensive cost of annotating training data. Furthermore, whilst developers can see and manipulate the generated code, non-expert users only see the output of execution, and therefore have the additional constraint of the generated code being exactly correct and executable. Finally, for users to trust models that automatically produce code, particularly in high-cost scenarios, it is important for models to provide an explanation of the generated code back to the user. This dissertation presents tasks, training methods/resources and new models for mapping NL to source code for both developers and non-expert users, and is divided into four parts. In the first part, we formalize the task of contextual code generation from NL for developers. We present ways to obtain inexpensive training datasets from large online code repositories, followed by methods to incorporate contextual awareness 3 into syntax-guided neural models to improve performance on the task. The second part shifts focus from developers to non-expert users, where we present methods to build NL interfaces that allow non-expert users to query databases by automatically mapping their NL requests to database SQL queries. Our methods are geared towards building deep learning models that improve in performance over time by leveraging user feedback and annotations obtained from crowd programmers, and open up inexpensive ways to build accurate NL interfaces for arbitrary database schemas. The third part of this dissertation presents the use of programmatic idioms as a means to significantly improve training time, as well as performance on both the NL to code tasks of parts 1 and 2. We discuss algorithms to extract frequently used programmatic idioms and train neural models to learn to apply them during code generation. Finally, we present models that describe the functionality of source code to users in NL as a first step towards building trustworthy language to code systems. Overall, this dissertation presents efficient deep learning models and training paradigms to map language to general purpose source code that will enable numerous applications for non-expert users as well as developers.

Coverage of Course Topics in Learnersourced SQL Exercises

Learning SQL from within: integrating database exercises into the database itself

Understanding Help-Seeking Behavior of Students Using LLMs vs. Web Search for Writing SQL Queries

ItsSQL: Intelligent Tutoring System for SQL

Teaching and Learning Programming and Software Engineering Via Interactive Gaming

The Development of an Automatic Test Assembly System for a Formative Assessment in Mastery Learning Instruction: Case of the SQL Mastery Course

Generating program inputs for database application testing.

Leveraging Prior Experience: An Expandable Auxiliary Knowledge Base for Text-to-SQL

Automated Questions About Learners' Own Code Help to Detect Fragile Knowledge

SQLRepair: Identifying and Repairing Mistakes in Student-Authored SQL Queries

Synthesis of SQL Queries from South African Local Language Narrations

Database state generation via dynamic symbolic execution for coverage criteria.

Learning to Map Natural Language to General Purpose Source Code

SQL Query Completion for Data Exploration

TCSR-SQL: Towards Table Content-aware Text-to-SQL with Self-retrieval

PROGpedia: Collection of source-code submitted to introductory programming assignments

Students Struggle to Explain Their Own Program Code

Learnersourcing in the Age of AI: Student, Educator and Machine Partnerships for Content Creation

Aspects on Finding the Optimal Practical Programming Exercise for MOOCs

Automated Generation of Computer Graded Unit Testing-Based Programming Assessments for Education