Benchmarking Large Language Models for Automated Verilog RTL Code Generation. Thakur, S., Ahmad, B., Fan, Z., Pearce, H., Tan, B., Karri, R., Dolan-Gavitt, B., & Garg, S. December, 2022. arXiv:2212.11140 [cs]
Benchmarking Large Language Models for Automated Verilog RTL Code Generation [link]Paper  doi  abstract   bibtex   
Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5% overall). Training/evaluation scripts and LLM checkpoints are available: https://github.com/shailja-thakur/VGen.
@misc{thakur_benchmarking_2022,
	title = {Benchmarking {Large} {Language} {Models} for {Automated} {Verilog} {RTL} {Code} {Generation}},
	url = {http://arxiv.org/abs/2212.11140},
	doi = {10.48550/arXiv.2212.11140},
	abstract = {Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9\% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5\% overall). Training/evaluation scripts and LLM checkpoints are available: https://github.com/shailja-thakur/VGen.},
	urldate = {2023-02-24},
	publisher = {arXiv},
	author = {Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan-Gavitt, Brendan and Garg, Siddharth},
	month = dec,
	year = {2022},
	note = {arXiv:2212.11140 [cs]},
	keywords = {\#broken, Computer Science - Machine Learning, Computer Science - Programming Languages, Computer Science - Software Engineering, Jab/\#Pre},
}

Downloads: 0