Evaluating LLMs for Hardware Design and Test

Jason Blocklove,Siddharth Garg,Ramesh Karri,Hammond Pearce
2024-04-24
Abstract:Large Language Models (LLMs) have demonstrated capabilities for producing code in Hardware Description Languages (HDLs). However, most of the focus remains on their abilities to write functional code, not test code. The hardware design process consists of both design and test, and so eschewing validation and verification leaves considerable potential benefit unexplored, given that a design and test framework may allow for progress towards full automation of the digital design pipeline. In this work, we perform one of the first studies exploring how a LLM can both design and test hardware modules from provided specifications. Using a suite of 8 representative benchmarks, we examined the capabilities and limitations of the state-of-the-art conversational LLMs when producing Verilog for functional and verification purposes. We taped out the benchmarks on a Skywater 130nm shuttle and received the functional chip.
Hardware Architecture,Artificial Intelligence,Computation and Language,Machine Learning,Programming Languages
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the capabilities of large - language models (LLMs) in hardware design and testing. Specifically, the researchers focus on how to use LLMs to generate functional Hardware Description Language (HDL) code, such as Verilog, and the corresponding testbenches from the given specification descriptions. This includes not only the design part, that is, generating hardware module code that implements specific functions, but also the testing part, that is, creating testbenches that can verify the correctness of these modules. Through this research, the author hopes to explore the potential of LLMs in the automated digital design process, especially their capabilities in design verification and testing. In the paper, 8 representative benchmark test cases are used to evaluate the performance of four of the latest conversational LLMs (ChatGPT - 4, ChatGPT - 3.5, Bard, HuggingChat) in generating functional HDL code and testbenches. The focus of the research is: 1. **Design Capability**: Evaluate the ability of LLMs to generate correct HDL code according to the given specification descriptions. 2. **Testing Capability**: Evaluate the ability of LLMs to generate effective, self - checking testbenches that can be used to verify the functionality of the generated HDL code. 3. **Interactivity**: Examine the effectiveness of using tool feedback (TF), simple human feedback (SHF), medium human feedback (MHF), and advanced human feedback (AHF) to fix errors when the code or testbenches generated by LLMs are incorrect. Through these evaluations, the researchers aim to reveal the actual application potential and limitations of current LLMs in the field of hardware design and testing, and provide directions for further research and development.