Toward More General Embeddings for Protein Design: Harnessing Joint Representations of Sequence and Structure

Toward More General Embeddings for Protein Design: Harnessing Joint Representations of Sequence and Structure. Mansoor, S., Baek, M., Madan, U., & Horvitz, E. Technical Report September, 2021. Company: Cold Spring Harbor Laboratory Distributor: Cold Spring Harbor Laboratory Label: Cold Spring Harbor Laboratory Section: New Results Type: article

Paper doi abstract bibtex

Protein embeddings learned from aligned sequences have been leveraged in a wide array of tasks in protein understanding and engineering. The sequence embeddings are generated through semi-supervised training on millions of sequences with deep neural models defined with hundreds of millions of parameters, and they continue to increase in performance on target tasks with increasing complexity. We report a more data-efficient approach to encode protein information through joint training on protein sequence and structure in a semi-supervised manner. We show that the method is able to encode both types of information to form a rich embedding space which can be used for downstream prediction tasks. We show that the incorporation of rich structural information into the context under consideration boosts the performance of the model by predicting the effects of single-mutations. We attribute increases in accuracy to the value of leveraging proximity within the enriched representation to identify sequentially and spatially close residues that would be affected by the mutation, using experimentally validated or predicted structures.

@techreport{mansoor_toward_2021,
	title = {Toward {More} {General} {Embeddings} for {Protein} {Design}: {Harnessing} {Joint} {Representations} of {Sequence} and {Structure}},
	copyright = {© 2021, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at http://creativecommons.org/licenses/by-nc-nd/4.0/},
	shorttitle = {Toward {More} {General} {Embeddings} for {Protein} {Design}},
	url = {https://www.biorxiv.org/content/10.1101/2021.09.01.458592v1},
	abstract = {Protein embeddings learned from aligned sequences have been leveraged in a wide array of tasks in protein understanding and engineering. The sequence embeddings are generated through semi-supervised training on millions of sequences with deep neural models defined with hundreds of millions of parameters, and they continue to increase in performance on target tasks with increasing complexity. We report a more data-efficient approach to encode protein information through joint training on protein sequence and structure in a semi-supervised manner. We show that the method is able to encode both types of information to form a rich embedding space which can be used for downstream prediction tasks. We show that the incorporation of rich structural information into the context under consideration boosts the performance of the model by predicting the effects of single-mutations. We attribute increases in accuracy to the value of leveraging proximity within the enriched representation to identify sequentially and spatially close residues that would be affected by the mutation, using experimentally validated or predicted structures.},
	language = {en},
	urldate = {2022-01-17},
	author = {Mansoor, Sanaa and Baek, Minkyung and Madan, Umesh and Horvitz, Eric},
	month = sep,
	year = {2021},
	doi = {10.1101/2021.09.01.458592},
	note = {Company: Cold Spring Harbor Laboratory
Distributor: Cold Spring Harbor Laboratory
Label: Cold Spring Harbor Laboratory
Section: New Results
Type: article},
	pages = {2021.09.01.458592},
}

Downloads: 0

{"_id":"simmgKX9psdm4PR9W","bibbaseid":"mansoor-baek-madan-horvitz-towardmoregeneralembeddingsforproteindesignharnessingjointrepresentationsofsequenceandstructure-2021","author_short":["Mansoor, S.","Baek, M.","Madan, U.","Horvitz, E."],"bibdata":{"bibtype":"techreport","type":"techreport","title":"Toward More General Embeddings for Protein Design: Harnessing Joint Representations of Sequence and Structure","copyright":"© 2021, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at http://creativecommons.org/licenses/by-nc-nd/4.0/","shorttitle":"Toward More General Embeddings for Protein Design","url":"https://www.biorxiv.org/content/10.1101/2021.09.01.458592v1","abstract":"Protein embeddings learned from aligned sequences have been leveraged in a wide array of tasks in protein understanding and engineering. The sequence embeddings are generated through semi-supervised training on millions of sequences with deep neural models defined with hundreds of millions of parameters, and they continue to increase in performance on target tasks with increasing complexity. We report a more data-efficient approach to encode protein information through joint training on protein sequence and structure in a semi-supervised manner. We show that the method is able to encode both types of information to form a rich embedding space which can be used for downstream prediction tasks. We show that the incorporation of rich structural information into the context under consideration boosts the performance of the model by predicting the effects of single-mutations. We attribute increases in accuracy to the value of leveraging proximity within the enriched representation to identify sequentially and spatially close residues that would be affected by the mutation, using experimentally validated or predicted structures.","language":"en","urldate":"2022-01-17","author":[{"propositions":[],"lastnames":["Mansoor"],"firstnames":["Sanaa"],"suffixes":[]},{"propositions":[],"lastnames":["Baek"],"firstnames":["Minkyung"],"suffixes":[]},{"propositions":[],"lastnames":["Madan"],"firstnames":["Umesh"],"suffixes":[]},{"propositions":[],"lastnames":["Horvitz"],"firstnames":["Eric"],"suffixes":[]}],"month":"September","year":"2021","doi":"10.1101/2021.09.01.458592","note":"Company: Cold Spring Harbor Laboratory Distributor: Cold Spring Harbor Laboratory Label: Cold Spring Harbor Laboratory Section: New Results Type: article","pages":"2021.09.01.458592","bibtex":"@techreport{mansoor_toward_2021,\n\ttitle = {Toward {More} {General} {Embeddings} for {Protein} {Design}: {Harnessing} {Joint} {Representations} of {Sequence} and {Structure}},\n\tcopyright = {© 2021, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at http://creativecommons.org/licenses/by-nc-nd/4.0/},\n\tshorttitle = {Toward {More} {General} {Embeddings} for {Protein} {Design}},\n\turl = {https://www.biorxiv.org/content/10.1101/2021.09.01.458592v1},\n\tabstract = {Protein embeddings learned from aligned sequences have been leveraged in a wide array of tasks in protein understanding and engineering. The sequence embeddings are generated through semi-supervised training on millions of sequences with deep neural models defined with hundreds of millions of parameters, and they continue to increase in performance on target tasks with increasing complexity. We report a more data-efficient approach to encode protein information through joint training on protein sequence and structure in a semi-supervised manner. We show that the method is able to encode both types of information to form a rich embedding space which can be used for downstream prediction tasks. We show that the incorporation of rich structural information into the context under consideration boosts the performance of the model by predicting the effects of single-mutations. We attribute increases in accuracy to the value of leveraging proximity within the enriched representation to identify sequentially and spatially close residues that would be affected by the mutation, using experimentally validated or predicted structures.},\n\tlanguage = {en},\n\turldate = {2022-01-17},\n\tauthor = {Mansoor, Sanaa and Baek, Minkyung and Madan, Umesh and Horvitz, Eric},\n\tmonth = sep,\n\tyear = {2021},\n\tdoi = {10.1101/2021.09.01.458592},\n\tnote = {Company: Cold Spring Harbor Laboratory\nDistributor: Cold Spring Harbor Laboratory\nLabel: Cold Spring Harbor Laboratory\nSection: New Results\nType: article},\n\tpages = {2021.09.01.458592},\n}\n\n","author_short":["Mansoor, S.","Baek, M.","Madan, U.","Horvitz, E."],"key":"mansoor_toward_2021","id":"mansoor_toward_2021","bibbaseid":"mansoor-baek-madan-horvitz-towardmoregeneralembeddingsforproteindesignharnessingjointrepresentationsofsequenceandstructure-2021","role":"author","urls":{"Paper":"https://www.biorxiv.org/content/10.1101/2021.09.01.458592v1"},"metadata":{"authorlinks":{}}},"bibtype":"techreport","biburl":"https://api.zotero.org/groups/4569015/items?key=PyfgexKqLxWBGOvlSn9YvDHz&format=bibtex&limit=100","dataSources":["poM3snARuSexLsqiF"],"keywords":[],"search_terms":["toward","more","general","embeddings","protein","design","harnessing","joint","representations","sequence","structure","mansoor","baek","madan","horvitz"],"title":"Toward More General Embeddings for Protein Design: Harnessing Joint Representations of Sequence and Structure","year":2021}