Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes. Vuruputoor, V. S., Monyak, D., Fetter, K. C., Webster, C., Bhattarai, A., Shrestha, B., Zaman, S., Bennett, J., McEvoy, S. L., Caballero, M., & Wegrzyn, J. L. Applications in Plant Sciences, 11(4):e11533, 2023. _eprint: https://bsapubs.onlinelibrary.wiley.com/doi/pdf/10.1002/aps3.11533
Paper doi abstract bibtex Premise Robust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and method on protein-coding gene predictions. Methods The impact of repeat masking, long-read and short-read inputs, and de novo and genome-guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. The annotations were benchmarked for structural traits and sequence similarity. Results Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono-exonic/multi-exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA-read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence-based and ab initio approaches are recommended, and a combination of short and long reads can improve genome annotation. Adding protein evidence from de novo assemblies, genome-guided transcriptome assemblies, or full-length proteins from OrthoDB generates more putative false positives as implemented in the current workflows. Post-processing with functional and structural filters is highly recommended. Discussion While the annotation of non-model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation and present a more robust set of metrics to evaluate the resulting predictions.
@article{vuruputoor_welcome_2023,
title = {Welcome to the big leaves: {Best} practices for improving genome annotation in non-model plant genomes},
volume = {11},
copyright = {© 2023 The Authors. Applications in Plant Sciences published by Wiley Periodicals LLC on behalf of Botanical Society of America.},
issn = {2168-0450},
shorttitle = {Welcome to the big leaves},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/aps3.11533},
doi = {10.1002/aps3.11533},
abstract = {Premise Robust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and method on protein-coding gene predictions. Methods The impact of repeat masking, long-read and short-read inputs, and de novo and genome-guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. The annotations were benchmarked for structural traits and sequence similarity. Results Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono-exonic/multi-exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA-read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence-based and ab initio approaches are recommended, and a combination of short and long reads can improve genome annotation. Adding protein evidence from de novo assemblies, genome-guided transcriptome assemblies, or full-length proteins from OrthoDB generates more putative false positives as implemented in the current workflows. Post-processing with functional and structural filters is highly recommended. Discussion While the annotation of non-model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation and present a more robust set of metrics to evaluate the resulting predictions.},
language = {en},
number = {4},
urldate = {2026-01-16},
journal = {Applications in Plant Sciences},
author = {Vuruputoor, Vidya S. and Monyak, Daniel and Fetter, Karl C. and Webster, Cynthia and Bhattarai, Akriti and Shrestha, Bikash and Zaman, Sumaira and Bennett, Jeremy and McEvoy, Susan L. and Caballero, Madison and Wegrzyn, Jill L.},
year = {2023},
note = {\_eprint: https://bsapubs.onlinelibrary.wiley.com/doi/pdf/10.1002/aps3.11533},
keywords = {BRAKER, MAKER, StringTie2, TSEBRA, gene identification, genome annotation, plant genomes},
pages = {e11533},
}
Downloads: 0
{"_id":"zCqSA2vwtxKwDEyCu","bibbaseid":"vuruputoor-monyak-fetter-webster-bhattarai-shrestha-zaman-bennett-etal-welcometothebigleavesbestpracticesforimprovinggenomeannotationinnonmodelplantgenomes-2023","author_short":["Vuruputoor, V. S.","Monyak, D.","Fetter, K. C.","Webster, C.","Bhattarai, A.","Shrestha, B.","Zaman, S.","Bennett, J.","McEvoy, S. L.","Caballero, M.","Wegrzyn, J. L."],"bibdata":{"bibtype":"article","type":"article","title":"Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes","volume":"11","copyright":"© 2023 The Authors. Applications in Plant Sciences published by Wiley Periodicals LLC on behalf of Botanical Society of America.","issn":"2168-0450","shorttitle":"Welcome to the big leaves","url":"https://onlinelibrary.wiley.com/doi/abs/10.1002/aps3.11533","doi":"10.1002/aps3.11533","abstract":"Premise Robust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and method on protein-coding gene predictions. Methods The impact of repeat masking, long-read and short-read inputs, and de novo and genome-guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. The annotations were benchmarked for structural traits and sequence similarity. Results Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono-exonic/multi-exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA-read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence-based and ab initio approaches are recommended, and a combination of short and long reads can improve genome annotation. Adding protein evidence from de novo assemblies, genome-guided transcriptome assemblies, or full-length proteins from OrthoDB generates more putative false positives as implemented in the current workflows. Post-processing with functional and structural filters is highly recommended. Discussion While the annotation of non-model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation and present a more robust set of metrics to evaluate the resulting predictions.","language":"en","number":"4","urldate":"2026-01-16","journal":"Applications in Plant Sciences","author":[{"propositions":[],"lastnames":["Vuruputoor"],"firstnames":["Vidya","S."],"suffixes":[]},{"propositions":[],"lastnames":["Monyak"],"firstnames":["Daniel"],"suffixes":[]},{"propositions":[],"lastnames":["Fetter"],"firstnames":["Karl","C."],"suffixes":[]},{"propositions":[],"lastnames":["Webster"],"firstnames":["Cynthia"],"suffixes":[]},{"propositions":[],"lastnames":["Bhattarai"],"firstnames":["Akriti"],"suffixes":[]},{"propositions":[],"lastnames":["Shrestha"],"firstnames":["Bikash"],"suffixes":[]},{"propositions":[],"lastnames":["Zaman"],"firstnames":["Sumaira"],"suffixes":[]},{"propositions":[],"lastnames":["Bennett"],"firstnames":["Jeremy"],"suffixes":[]},{"propositions":[],"lastnames":["McEvoy"],"firstnames":["Susan","L."],"suffixes":[]},{"propositions":[],"lastnames":["Caballero"],"firstnames":["Madison"],"suffixes":[]},{"propositions":[],"lastnames":["Wegrzyn"],"firstnames":["Jill","L."],"suffixes":[]}],"year":"2023","note":"_eprint: https://bsapubs.onlinelibrary.wiley.com/doi/pdf/10.1002/aps3.11533","keywords":"BRAKER, MAKER, StringTie2, TSEBRA, gene identification, genome annotation, plant genomes","pages":"e11533","bibtex":"@article{vuruputoor_welcome_2023,\n\ttitle = {Welcome to the big leaves: {Best} practices for improving genome annotation in non-model plant genomes},\n\tvolume = {11},\n\tcopyright = {© 2023 The Authors. Applications in Plant Sciences published by Wiley Periodicals LLC on behalf of Botanical Society of America.},\n\tissn = {2168-0450},\n\tshorttitle = {Welcome to the big leaves},\n\turl = {https://onlinelibrary.wiley.com/doi/abs/10.1002/aps3.11533},\n\tdoi = {10.1002/aps3.11533},\n\tabstract = {Premise Robust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and method on protein-coding gene predictions. Methods The impact of repeat masking, long-read and short-read inputs, and de novo and genome-guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. The annotations were benchmarked for structural traits and sequence similarity. Results Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono-exonic/multi-exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA-read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence-based and ab initio approaches are recommended, and a combination of short and long reads can improve genome annotation. Adding protein evidence from de novo assemblies, genome-guided transcriptome assemblies, or full-length proteins from OrthoDB generates more putative false positives as implemented in the current workflows. Post-processing with functional and structural filters is highly recommended. Discussion While the annotation of non-model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation and present a more robust set of metrics to evaluate the resulting predictions.},\n\tlanguage = {en},\n\tnumber = {4},\n\turldate = {2026-01-16},\n\tjournal = {Applications in Plant Sciences},\n\tauthor = {Vuruputoor, Vidya S. and Monyak, Daniel and Fetter, Karl C. and Webster, Cynthia and Bhattarai, Akriti and Shrestha, Bikash and Zaman, Sumaira and Bennett, Jeremy and McEvoy, Susan L. and Caballero, Madison and Wegrzyn, Jill L.},\n\tyear = {2023},\n\tnote = {\\_eprint: https://bsapubs.onlinelibrary.wiley.com/doi/pdf/10.1002/aps3.11533},\n\tkeywords = {BRAKER, MAKER, StringTie2, TSEBRA, gene identification, genome annotation, plant genomes},\n\tpages = {e11533},\n}\n\n\n\n\n\n\n\n","author_short":["Vuruputoor, V. S.","Monyak, D.","Fetter, K. C.","Webster, C.","Bhattarai, A.","Shrestha, B.","Zaman, S.","Bennett, J.","McEvoy, S. L.","Caballero, M.","Wegrzyn, J. L."],"key":"vuruputoor_welcome_2023","id":"vuruputoor_welcome_2023","bibbaseid":"vuruputoor-monyak-fetter-webster-bhattarai-shrestha-zaman-bennett-etal-welcometothebigleavesbestpracticesforimprovinggenomeannotationinnonmodelplantgenomes-2023","role":"author","urls":{"Paper":"https://onlinelibrary.wiley.com/doi/abs/10.1002/aps3.11533"},"keyword":["BRAKER","MAKER","StringTie2","TSEBRA","gene identification","genome annotation","plant genomes"],"metadata":{"authorlinks":{}}},"bibtype":"article","biburl":"https://bibbase.org/zotero/upscpub","dataSources":["9cGcv2t8pRzC92kzs"],"keywords":["braker","maker","stringtie2","tsebra","gene identification","genome annotation","plant genomes"],"search_terms":["welcome","big","leaves","best","practices","improving","genome","annotation","non","model","plant","genomes","vuruputoor","monyak","fetter","webster","bhattarai","shrestha","zaman","bennett","mcevoy","caballero","wegrzyn"],"title":"Welcome to the big leaves: Best practices for improving genome annotation in non-model plant genomes","year":2023}