Benchmarking Large Language Models for Automated Verilog RTL Code Generation. Thakur, S., Ahmad, B., Fan, Z., Pearce, H., Tan, B., Karri, R., Dolan-Gavitt, B., & Garg, S. December, 2022. arXiv:2212.11140 [cs]
Paper doi abstract bibtex Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5% overall). Training/evaluation scripts and LLM checkpoints are available: https://github.com/shailja-thakur/VGen.
@misc{thakur_benchmarking_2022,
title = {Benchmarking {Large} {Language} {Models} for {Automated} {Verilog} {RTL} {Code} {Generation}},
url = {http://arxiv.org/abs/2212.11140},
doi = {10.48550/arXiv.2212.11140},
abstract = {Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9\% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5\% overall). Training/evaluation scripts and LLM checkpoints are available: https://github.com/shailja-thakur/VGen.},
urldate = {2023-02-24},
publisher = {arXiv},
author = {Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan-Gavitt, Brendan and Garg, Siddharth},
month = dec,
year = {2022},
note = {arXiv:2212.11140 [cs]},
keywords = {\#broken, Computer Science - Machine Learning, Computer Science - Programming Languages, Computer Science - Software Engineering, Jab/\#Pre},
}
Downloads: 0
{"_id":"ivQp7vMmPXTLtFgJ9","bibbaseid":"thakur-ahmad-fan-pearce-tan-karri-dolangavitt-garg-benchmarkinglargelanguagemodelsforautomatedverilogrtlcodegeneration-2022","author_short":["Thakur, S.","Ahmad, B.","Fan, Z.","Pearce, H.","Tan, B.","Karri, R.","Dolan-Gavitt, B.","Garg, S."],"bibdata":{"bibtype":"misc","type":"misc","title":"Benchmarking Large Language Models for Automated Verilog RTL Code Generation","url":"http://arxiv.org/abs/2212.11140","doi":"10.48550/arXiv.2212.11140","abstract":"Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5% overall). Training/evaluation scripts and LLM checkpoints are available: https://github.com/shailja-thakur/VGen.","urldate":"2023-02-24","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Thakur"],"firstnames":["Shailja"],"suffixes":[]},{"propositions":[],"lastnames":["Ahmad"],"firstnames":["Baleegh"],"suffixes":[]},{"propositions":[],"lastnames":["Fan"],"firstnames":["Zhenxing"],"suffixes":[]},{"propositions":[],"lastnames":["Pearce"],"firstnames":["Hammond"],"suffixes":[]},{"propositions":[],"lastnames":["Tan"],"firstnames":["Benjamin"],"suffixes":[]},{"propositions":[],"lastnames":["Karri"],"firstnames":["Ramesh"],"suffixes":[]},{"propositions":[],"lastnames":["Dolan-Gavitt"],"firstnames":["Brendan"],"suffixes":[]},{"propositions":[],"lastnames":["Garg"],"firstnames":["Siddharth"],"suffixes":[]}],"month":"December","year":"2022","note":"arXiv:2212.11140 [cs]","keywords":"#broken, Computer Science - Machine Learning, Computer Science - Programming Languages, Computer Science - Software Engineering, Jab/#Pre","bibtex":"@misc{thakur_benchmarking_2022,\n\ttitle = {Benchmarking {Large} {Language} {Models} for {Automated} {Verilog} {RTL} {Code} {Generation}},\n\turl = {http://arxiv.org/abs/2212.11140},\n\tdoi = {10.48550/arXiv.2212.11140},\n\tabstract = {Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9\\% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5\\% overall). Training/evaluation scripts and LLM checkpoints are available: https://github.com/shailja-thakur/VGen.},\n\turldate = {2023-02-24},\n\tpublisher = {arXiv},\n\tauthor = {Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan-Gavitt, Brendan and Garg, Siddharth},\n\tmonth = dec,\n\tyear = {2022},\n\tnote = {arXiv:2212.11140 [cs]},\n\tkeywords = {\\#broken, Computer Science - Machine Learning, Computer Science - Programming Languages, Computer Science - Software Engineering, Jab/\\#Pre},\n}\n\n","author_short":["Thakur, S.","Ahmad, B.","Fan, Z.","Pearce, H.","Tan, B.","Karri, R.","Dolan-Gavitt, B.","Garg, S."],"key":"thakur_benchmarking_2022","id":"thakur_benchmarking_2022","bibbaseid":"thakur-ahmad-fan-pearce-tan-karri-dolangavitt-garg-benchmarkinglargelanguagemodelsforautomatedverilogrtlcodegeneration-2022","role":"author","urls":{"Paper":"http://arxiv.org/abs/2212.11140"},"keyword":["#broken","Computer Science - Machine Learning","Computer Science - Programming Languages","Computer Science - Software Engineering","Jab/#Pre"],"metadata":{"authorlinks":{}}},"bibtype":"misc","biburl":"https://api.zotero.org/users/4645877/collections/5QADJUWI/items?key=OGCZ3uLZZq4lLIXadnuJrB1J&format=bibtex&limit=100","dataSources":["TRtmubHSqHw6999cH","Wsv2bQ4jPuc7qme8R"],"keywords":["#broken","computer science - machine learning","computer science - programming languages","computer science - software engineering","jab/#pre"],"search_terms":["benchmarking","large","language","models","automated","verilog","rtl","code","generation","thakur","ahmad","fan","pearce","tan","karri","dolan-gavitt","garg"],"title":"Benchmarking Large Language Models for Automated Verilog RTL Code Generation","year":2022}