On the Effectiveness of Large Language Models in Domain-Specific Code Generation

Xiaodong Gu,Meng Chen,Yalan Lin,Yuhan Hu,Hongyu Zhang,Chengcheng Wan,Zhao Wei,Yong Xu,Juhong Wang
2024-09-10
Abstract:Large language models (LLMs) such as ChatGPT have shown remarkable capabilities in code generation. Despite significant achievements, they rely on enormous training data to acquire a broad spectrum of open-domain knowledge. Besides, their evaluation revolves around open-domain benchmarks like HumanEval, which primarily consist of programming contests. Therefore, it is hard to fully characterize the intricacies and challenges associated with particular domains (e.g., web, game, and math). In this paper, we conduct an in-depth study of the LLMs in domain-specific code generation. Our results demonstrate that LLMs exhibit sub-optimal performance in generating domain-specific code, due to their limited proficiency in utilizing domain-specific libraries. We further observe that incorporating API knowledge as prompts can empower LLMs to generate more professional code. Based on these findings, we further investigate how to effectively incorporate API knowledge into the code generation process. We experiment with three strategies for incorporating domain knowledge, namely, external knowledge inquirer, chain-of-thought prompting, and chain-of-thought fine-tuning. We refer to these strategies as a new code generation approach called DomCoder. Experimental results show that all strategies of DomCoder lead to improvement in the effectiveness of domain-specific code generation under certain settings.
Software Engineering
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the performance and limitations of large - language models (LLMs) in code generation in specific domains. Specifically, the paper explores the following issues: 1. **Effectiveness of LLMs in code generation in specific domains**: Although LLMs perform well in code - generation tasks in general domains, how effective are they when dealing with code generation in specific domains? These specific domains may include web development, game development, etc., which usually require the use of specific frameworks and libraries (such as HTTP, RPC, Unreal, etc.). The paper evaluates the performance of LLMs in these specific domains through experiments and finds that there is a significant performance degradation when they generate code in specific domains. 2. **How to effectively prompt LLMs to generate code in specific domains**: Since code generation in specific domains requires in - depth understanding of specific libraries and APIs, how can the performance of LLMs in these tasks be improved through effective prompts? The paper designs several knowledge - based prompting methods and studies the impact of these prompting methods on LLMs' generation of code in specific domains. 3. **How to enhance the code - generation model in specific domains**: Besides improving the performance of LLMs through prompts, can their ability in code - generation tasks in specific domains be further enhanced through other means (such as fine - tuning)? The paper proposes a new method - DomCoder, which integrates domain knowledge into the code - generation process through three strategies (external knowledge query, chain - of - thought prompting, chain - of - thought fine - tuning), thereby improving the performance of LLMs in code - generation tasks in specific domains. ### Main contributions of the paper 1. **Empirical research**: Conducts empirical research on the ability of LLMs in code generation in specific domains, revealing their performance and limitations in specific domains. 2. **Effectiveness of prompting methods**: Studies the effectiveness of different types of prompting methods on LLMs' generation of code in specific domains and proposes a knowledge - enhanced prompting method. 3. **New integration method**: Proposes a new method - DomCoder, which integrates domain knowledge into the code - generation process of LLMs through multiple strategies, significantly improving the effect of code generation in specific domains. 4. **In - depth discussion**: Conducts in - depth analysis of the research results and discusses future research directions. ### Experimental setup To answer the above research questions, the paper constructs a specific - domain code data set that includes two different domains (web development and game development) and two programming languages (Go and C++). The code functions in the data set are from public repositories on GitHub and are filtered and extracted using six specific - domain libraries (Gin, Prometheus, gRPC - go, Unreal Engine, Cocos2d - x, Bgfx). ### Experimental results 1. **Quantitative results**: The experimental results show that, compared with code - generation tasks in general domains, LLMs perform worse in code - generation tasks in specific domains. For example, ChatGPT's BLEU score in specific domains drops by an average of 70.35% and its CodeBLEU score drops by an average of 51.48%. 2. **Qualitative results**: Through manual inspection of the code generated by ChatGPT, it is found that common errors include API misuse and missing API calls. These errors indicate that LLMs lack in - depth understanding of specific libraries and APIs in code generation in specific domains. ### Conclusion Through empirical research and method innovation, the paper reveals the limitations of LLMs in code generation in specific domains and proposes a new method - DomCoder, which effectively improves the performance of LLMs in code - generation tasks in specific domains. These research results are of great significance for promoting the application of LLMs in specific domains.