Revisiting VerilogEval: Newer LLMs, In-Context Learning, and Specification-to-RTL Tasks

Nathaniel Pinckney,Christopher Batten,Mingjie Liu,Haoxing Ren,Brucek Khailany
2024-08-21
Abstract:The application of large-language models (LLMs) to digital hardware code generation is an emerging field. Most LLMs are primarily trained on natural language and software code. Hardware code, such as Verilog, represents only a small portion of the training data and few hardware benchmarks exist. To address this gap, the open-source VerilogEval benchmark was released in 2023, providing a consistent evaluation framework for LLMs on code completion tasks. It was tested on state-of-the-art models at the time including GPT-4. However, VerilogEval and other Verilog generation benchmarks lack failure analysis and, in present form, are not conducive to exploring prompting techniques. Also, since VerilogEval's release, both commercial and open-source models have seen continued development. In this work, we evaluate new commercial and open-source models of varying sizes against an improved VerilogEval benchmark suite. We enhance VerilogEval's infrastructure and dataset by automatically classifying failures, introduce new prompts for supporting in-context learning (ICL) examples, and extend the supported tasks to specification-to-RTL translation. We find a measurable improvement in commercial state-of-the-art models, with GPT-4 Turbo achieving a 59% pass rate on spec-to-RTL tasks. We also study the performance of open-source and domain-specific models that have emerged, and demonstrate that models can benefit substantially from ICL. We find that recently-released Llama 3.1 405B achieves a pass rate of 58%, effectively matching that of GPT-4 Turbo, and that the much smaller domain-specific RTL-Coder 6.7B models achieve an impressive 37% pass rate. However, prompt engineering is key to achieving good pass rates, and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is key to continued model development and deployment.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address several key issues in the task of digital hardware code generation using large language models (LLMs): 1. **Limitations of Existing Benchmarks**: - Current benchmarks (e.g., VerilogEval) mainly focus on code completion tasks and lack support for spec-to-Register Transfer Level (RTL) conversion tasks. - Existing benchmarks do not provide detailed failure analysis, making it difficult to understand model performance on specific tasks. - Lack of support for in-context learning (ICL) examples, which limits the evaluation of model performance in different contexts. 2. **Improvement of Model Performance**: - Evaluate the performance of the latest commercial and open-source models in code generation tasks, particularly their ability in spec-to-RTL conversion tasks. - Investigate the impact of in-context learning examples on model performance and the performance differences of various models on different tasks. 3. **Improvement of Benchmark Infrastructure**: - Add support for spec-to-RTL tasks to make the benchmarks more comprehensive. - Introduce automatic failure classification to better understand the reasons for model failures. - Provide mechanisms to support in-context learning examples to explore different prompting techniques. - Improve the evaluation environment to make it more extensible and easier for manual inspection. ### Main Contributions 1. **Extending the VerilogEval Benchmark**: - Added support for spec-to-RTL tasks to align with the instruction tuning of current models. - Introduced in-context learning examples to improve model performance on specific tasks. - Provided automatic failure classification to gain a more detailed understanding of model failures. 2. **Evaluating the Latest Models**: - Evaluated several newly released commercial and open-source models, including GPT-4 Turbo, Llama 3.1 8B/70B/405B, Mistral Large, Deepseek Coder 33B and 6.7B, CodeGemma 7B, and RTL-Coder. - Found that GPT-4 Turbo achieved a 59% pass rate on spec-to-RTL tasks, while Llama 3.1 405B reached a 58% pass rate. 3. **Impact of In-Context Learning**: - Studied the performance changes of different models with an increasing number of in-context learning examples, finding that some models (e.g., GPT-4 Turbo) showed stable or improved performance with more in-context learning examples, while other models (e.g., Llama 3 70B) exhibited different trends. 4. **Infrastructure Improvements**: - Improved the evaluation environment using Makefile and text files, making it more extensible and easier for manual inspection. - Provided a publicly available improved version of the VerilogEval benchmark for further research and development by the community. ### Conclusion By extending and improving the VerilogEval benchmark, the paper provides a more robust framework for evaluating the performance of large language models in digital hardware code generation tasks. The research results indicate that the latest models (e.g., GPT-4 Turbo and Llama 3.1 405B) perform excellently in code completion and spec-to-RTL conversion tasks, while open-source models and domain-specific models are gradually approaching or even surpassing last year's closed models. Additionally, the introduction of in-context learning examples significantly improved the performance of certain models, but the impact varies depending on the model and task, highlighting the importance of task-specific tuning. The improved benchmark infrastructure, especially the new failure classification feature, provides valuable insights into understanding the types of errors made by different models.