RTLCoder: Fully Open-Source and Efficient LLM-Assisted RTL Code Generation Technique

Shang Liu,Wenji Fang,Yao Lu,Jing Wang,Qijun Zhang,Hongce Zhang,Zhiyao Xie
2024-10-07
Abstract:The automatic generation of RTL code (e.g., Verilog) using natural language instructions and large language models (LLMs) has attracted significant research interest recently. However, most existing approaches heavily rely on commercial LLMs such as ChatGPT, while open-source LLMs tailored for this specific design generation task exhibit notably inferior performance. The absence of high-quality open-source solutions restricts the flexibility and data privacy of this emerging technique. In this study, we present a new customized LLM solution with a modest parameter count of only 7B, achieving better performance than GPT-3.5 on all representative benchmarks for RTL code generation. Especially, it outperforms GPT-4 in VerilogEval Machine benchmark. This remarkable balance between accuracy and efficiency is made possible by leveraging our new RTL code dataset and a customized LLM algorithm, both of which have been made fully open-source. Furthermore, we have successfully quantized our LLM to 4-bit with a total size of 4GB, enabling it to function on a single laptop with only slight performance degradation. This efficiency allows the RTL generator to serve as a local assistant for engineers, ensuring all design privacy concerns are addressed.
Programming Languages,Hardware Architecture
What problem does this paper attempt to address?
The problem this paper attempts to address is the poor performance of existing open-source large language models (LLMs) in automatically generating RTL code (such as Verilog) from natural language instructions. While commercial LLMs perform better, they pose issues related to data privacy and flexibility. Therefore, the paper proposes a new, high-performance open-source LLM solution—RTLCoder, aimed at overcoming the limitations of existing methods, achieving better performance and higher efficiency, while ensuring user data privacy. Specifically, the main contributions of the paper include: 1. **Dataset Generation**: An automated data generation process is proposed, resulting in a large dataset containing over 27,000 instruction-code samples, addressing the issue of obtaining high-quality data for IC design tasks. 2. **Model Training Scheme**: A new LLM training scheme based on code quality feedback is introduced, further enhancing the final model's performance, surpassing GPT-3.5 in multiple benchmarks and being comparable to GPT-4. 3. **Lightweight Model Design**: A lightweight model with only 700 million parameters is designed, which, after quantization, requires only 4GB of memory to run. This makes it suitable as an auxiliary tool for engineers in local environments, eliminating data privacy concerns. 4. **Fully Open Source**: All components of RTLCoder, including the data generation process, the complete dataset, model training algorithms, and the final fine-tuned model, are fully open-sourced, facilitating researchers to replicate and improve upon it. Through these contributions, RTLCoder not only achieves industry-leading performance but also provides a flexible and secure solution for research and practical applications.