MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation

Yongan Zhang,Zhongzhi Yu,Yonggan Fu,Cheng Wan,Yingyan Celine Lin
2024-07-03
Abstract:Large Language Models (LLMs) have recently shown promise in streamlining hardware design processes by encapsulating vast amounts of domain-specific data. In addition, they allow users to interact with the design processes through natural language instructions, thus making hardware design more accessible to developers. However, effectively leveraging LLMs in hardware design necessitates providing domain-specific data during inference (e.g., through in-context learning), fine-tuning, or pre-training. Unfortunately, existing publicly available hardware datasets are often limited in size, complexity, or detail, which hinders the effectiveness of LLMs in hardware design tasks. To address this issue, we first propose a set of criteria for creating high-quality hardware datasets that can effectively enhance LLM-assisted hardware design. Based on these criteria, we propose a Multi-Grained-Verilog (MG-Verilog) dataset, which encompasses descriptions at various levels of detail and corresponding code samples. To benefit the broader hardware design community, we have developed an open-source infrastructure that facilitates easy access, integration, and extension of the dataset to meet specific project needs. Furthermore, to fully exploit the potential of the MG-Verilog dataset, which varies in complexity and detail, we introduce a balanced fine-tuning scheme. This scheme serves as a unique use case to leverage the diverse levels of detail provided by the dataset. Extensive experiments demonstrate that the proposed dataset and fine-tuning scheme consistently improve the performance of LLMs in hardware design tasks.
Machine Learning,Artificial Intelligence,Hardware Architecture
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of large language models (LLMs) performing poorly in hardware design tasks. Although existing LLMs have demonstrated strong capabilities in other fields, they still have limitations in generating practically usable hardware design code. Specifically, existing public hardware datasets often lack in scale, complexity, and detailed descriptions, which limits the effective application of LLMs in hardware design tasks. To solve these problems, the authors propose a dataset named Multi-Grained-Verilog (MG-Verilog). This dataset includes hardware descriptions of varying levels of detail and their corresponding Verilog code samples, aiming to improve the performance of LLMs in hardware design tasks. Additionally, the authors propose a balanced fine-tuning scheme to fully utilize the different levels of detail in the dataset. ### Main Contributions 1. **Proposing Standards for High-Quality Hardware Datasets**: The authors propose a set of standards for creating high-quality hardware datasets, which can guide the development of future datasets. 2. **Releasing the MG-Verilog Dataset**: This is an open-source, multi-grained Verilog dataset containing over 11,000 Verilog code samples and their corresponding natural language descriptions. The dataset provides descriptions of varying levels of detail, from high-level overviews to line-by-line annotations. 3. **Balanced Fine-Tuning Scheme**: The authors propose a balanced fine-tuning scheme that achieves a balance between global and local code semantic knowledge by randomly selecting training samples with different levels of description. 4. **Experimental Validation**: Extensive experimental results show that LLMs fine-tuned with the MG-Verilog dataset outperform models trained on other datasets in terms of the accuracy and complexity of generated Verilog code. ### Background and Motivation The main challenges faced by existing LLMs in hardware design tasks include: - **Generating Non-Synthesizable or Non-Functional Hardware Source Code**: The code generated by existing LLMs may not be directly usable for actual hardware design. - **Lack of Sufficient Domain-Specific Data**: Existing hardware datasets are insufficient in scale, complexity, and detailed descriptions, limiting the effective application of LLMs. To overcome these challenges, the authors propose the MG-Verilog dataset and a balanced fine-tuning scheme to improve the performance of LLMs in hardware design tasks.