Single-sequence protein structure prediction using supervised transformer protein language models

Single-sequence protein structure prediction using supervised transformer protein language models. Wang, W., Peng, Z., & Yang, J. Technical Report January, 2022. Company: Cold Spring Harbor Laboratory Distributor: Cold Spring Harbor Laboratory Label: Cold Spring Harbor Laboratory Section: New Results Type: article

Paper doi abstract bibtex

It remains challenging for single-sequence protein structure prediction with AlphaFold2 and other deep learning methods. In this work, we introduce trRosettaX-Single, a novel algorithm for singlesequence protein structure prediction. It is built on sequence embedding from s-ESM-1b, a supervised transformer protein language model optimized from the pre-trained model ESM-1b. The sequence embedding is fed into a multi-scale network with knowledge distillation to predict inter-residue 2D geometry, including distance and orientations. The predicted 2D geometry is then used to reconstruct 3D structure models based on energy minimization. Benchmark tests show that trRosettaX-Single outperforms AlphaFold2 and RoseTTAFold on natural proteins. For instance, with single-sequence input, trRosettaX-Single generates structure models with an average TM-score \textasciitilde0.5 on 77 CASP14 domains, significantly higher than AlphaFold2 (0.35) and RoseTTAFold (0.34). Further test on 101 human-designed proteins indicates that trRosettaX-Single works very well, with accuracy (average TM-score 0.77) approaching AlphaFold2 and higher than RoseTTAFold, but using much less computing resource. On 2000 designed proteins from network hallucination, trRosettaX-Single generates structure models highly consistent to the hallucinated ones. These data suggest that trRosettaX-Single may find immediate applications in de novo protein design and related studies. trRosettaX-Single is available through the trRosetta server at: http://yanglab.nankai.edu.cn/trRosetta/.

@techreport{wang_single-sequence_2022,
	title = {Single-sequence protein structure prediction using supervised transformer protein language models},
	copyright = {© 2022, Posted by Cold Spring Harbor Laboratory. The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.},
	url = {https://www.biorxiv.org/content/10.1101/2022.01.15.476476v1},
	abstract = {It remains challenging for single-sequence protein structure prediction with AlphaFold2 and other deep learning methods. In this work, we introduce trRosettaX-Single, a novel algorithm for singlesequence protein structure prediction. It is built on sequence embedding from s-ESM-1b, a supervised transformer protein language model optimized from the pre-trained model ESM-1b. The sequence embedding is fed into a multi-scale network with knowledge distillation to predict inter-residue 2D geometry, including distance and orientations. The predicted 2D geometry is then used to reconstruct 3D structure models based on energy minimization. Benchmark tests show that trRosettaX-Single outperforms AlphaFold2 and RoseTTAFold on natural proteins. For instance, with single-sequence input, trRosettaX-Single generates structure models with an average TM-score {\textasciitilde}0.5 on 77 CASP14 domains, significantly higher than AlphaFold2 (0.35) and RoseTTAFold (0.34). Further test on 101 human-designed proteins indicates that trRosettaX-Single works very well, with accuracy (average TM-score 0.77) approaching AlphaFold2 and higher than RoseTTAFold, but using much less computing resource. On 2000 designed proteins from network hallucination, trRosettaX-Single generates structure models highly consistent to the hallucinated ones. These data suggest that trRosettaX-Single may find immediate applications in de novo protein design and related studies. trRosettaX-Single is available through the trRosetta server at: http://yanglab.nankai.edu.cn/trRosetta/.},
	language = {en},
	urldate = {2022-01-25},
	author = {Wang, Wenkai and Peng, Zhenling and Yang, Jianyi},
	month = jan,
	year = {2022},
	doi = {10.1101/2022.01.15.476476},
	note = {Company: Cold Spring Harbor Laboratory
Distributor: Cold Spring Harbor Laboratory
Label: Cold Spring Harbor Laboratory
Section: New Results
Type: article},
	pages = {2022.01.15.476476},
}

Downloads: 0

{"_id":"RhnsnrahQzSBwdoJ3","bibbaseid":"wang-peng-yang-singlesequenceproteinstructurepredictionusingsupervisedtransformerproteinlanguagemodels-2022","author_short":["Wang, W.","Peng, Z.","Yang, J."],"bibdata":{"bibtype":"techreport","type":"techreport","title":"Single-sequence protein structure prediction using supervised transformer protein language models","copyright":"© 2022, Posted by Cold Spring Harbor Laboratory. The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.","url":"https://www.biorxiv.org/content/10.1101/2022.01.15.476476v1","abstract":"It remains challenging for single-sequence protein structure prediction with AlphaFold2 and other deep learning methods. In this work, we introduce trRosettaX-Single, a novel algorithm for singlesequence protein structure prediction. It is built on sequence embedding from s-ESM-1b, a supervised transformer protein language model optimized from the pre-trained model ESM-1b. The sequence embedding is fed into a multi-scale network with knowledge distillation to predict inter-residue 2D geometry, including distance and orientations. The predicted 2D geometry is then used to reconstruct 3D structure models based on energy minimization. Benchmark tests show that trRosettaX-Single outperforms AlphaFold2 and RoseTTAFold on natural proteins. For instance, with single-sequence input, trRosettaX-Single generates structure models with an average TM-score \\textasciitilde0.5 on 77 CASP14 domains, significantly higher than AlphaFold2 (0.35) and RoseTTAFold (0.34). Further test on 101 human-designed proteins indicates that trRosettaX-Single works very well, with accuracy (average TM-score 0.77) approaching AlphaFold2 and higher than RoseTTAFold, but using much less computing resource. On 2000 designed proteins from network hallucination, trRosettaX-Single generates structure models highly consistent to the hallucinated ones. These data suggest that trRosettaX-Single may find immediate applications in de novo protein design and related studies. trRosettaX-Single is available through the trRosetta server at: http://yanglab.nankai.edu.cn/trRosetta/.","language":"en","urldate":"2022-01-25","author":[{"propositions":[],"lastnames":["Wang"],"firstnames":["Wenkai"],"suffixes":[]},{"propositions":[],"lastnames":["Peng"],"firstnames":["Zhenling"],"suffixes":[]},{"propositions":[],"lastnames":["Yang"],"firstnames":["Jianyi"],"suffixes":[]}],"month":"January","year":"2022","doi":"10.1101/2022.01.15.476476","note":"Company: Cold Spring Harbor Laboratory Distributor: Cold Spring Harbor Laboratory Label: Cold Spring Harbor Laboratory Section: New Results Type: article","pages":"2022.01.15.476476","bibtex":"@techreport{wang_single-sequence_2022,\n\ttitle = {Single-sequence protein structure prediction using supervised transformer protein language models},\n\tcopyright = {© 2022, Posted by Cold Spring Harbor Laboratory. The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.},\n\turl = {https://www.biorxiv.org/content/10.1101/2022.01.15.476476v1},\n\tabstract = {It remains challenging for single-sequence protein structure prediction with AlphaFold2 and other deep learning methods. In this work, we introduce trRosettaX-Single, a novel algorithm for singlesequence protein structure prediction. It is built on sequence embedding from s-ESM-1b, a supervised transformer protein language model optimized from the pre-trained model ESM-1b. The sequence embedding is fed into a multi-scale network with knowledge distillation to predict inter-residue 2D geometry, including distance and orientations. The predicted 2D geometry is then used to reconstruct 3D structure models based on energy minimization. Benchmark tests show that trRosettaX-Single outperforms AlphaFold2 and RoseTTAFold on natural proteins. For instance, with single-sequence input, trRosettaX-Single generates structure models with an average TM-score {\\textasciitilde}0.5 on 77 CASP14 domains, significantly higher than AlphaFold2 (0.35) and RoseTTAFold (0.34). Further test on 101 human-designed proteins indicates that trRosettaX-Single works very well, with accuracy (average TM-score 0.77) approaching AlphaFold2 and higher than RoseTTAFold, but using much less computing resource. On 2000 designed proteins from network hallucination, trRosettaX-Single generates structure models highly consistent to the hallucinated ones. These data suggest that trRosettaX-Single may find immediate applications in de novo protein design and related studies. trRosettaX-Single is available through the trRosetta server at: http://yanglab.nankai.edu.cn/trRosetta/.},\n\tlanguage = {en},\n\turldate = {2022-01-25},\n\tauthor = {Wang, Wenkai and Peng, Zhenling and Yang, Jianyi},\n\tmonth = jan,\n\tyear = {2022},\n\tdoi = {10.1101/2022.01.15.476476},\n\tnote = {Company: Cold Spring Harbor Laboratory\nDistributor: Cold Spring Harbor Laboratory\nLabel: Cold Spring Harbor Laboratory\nSection: New Results\nType: article},\n\tpages = {2022.01.15.476476},\n}\n\n","author_short":["Wang, W.","Peng, Z.","Yang, J."],"key":"wang_single-sequence_2022","id":"wang_single-sequence_2022","bibbaseid":"wang-peng-yang-singlesequenceproteinstructurepredictionusingsupervisedtransformerproteinlanguagemodels-2022","role":"author","urls":{"Paper":"https://www.biorxiv.org/content/10.1101/2022.01.15.476476v1"},"metadata":{"authorlinks":{}}},"bibtype":"techreport","biburl":"https://api.zotero.org/groups/4569015/items?key=PyfgexKqLxWBGOvlSn9YvDHz&format=bibtex&limit=100","dataSources":["poM3snARuSexLsqiF"],"keywords":[],"search_terms":["single","sequence","protein","structure","prediction","using","supervised","transformer","protein","language","models","wang","peng","yang"],"title":"Single-sequence protein structure prediction using supervised transformer protein language models","year":2022}