Toward More General Embeddings for Protein Design: Harnessing Joint Representations of Sequence and Structure. Mansoor, S., Baek, M., Madan, U., & Horvitz, E. Technical Report September, 2021. Company: Cold Spring Harbor Laboratory Distributor: Cold Spring Harbor Laboratory Label: Cold Spring Harbor Laboratory Section: New Results Type: article
Toward More General Embeddings for Protein Design: Harnessing Joint Representations of Sequence and Structure [link]Paper  doi  abstract   bibtex   
Protein embeddings learned from aligned sequences have been leveraged in a wide array of tasks in protein understanding and engineering. The sequence embeddings are generated through semi-supervised training on millions of sequences with deep neural models defined with hundreds of millions of parameters, and they continue to increase in performance on target tasks with increasing complexity. We report a more data-efficient approach to encode protein information through joint training on protein sequence and structure in a semi-supervised manner. We show that the method is able to encode both types of information to form a rich embedding space which can be used for downstream prediction tasks. We show that the incorporation of rich structural information into the context under consideration boosts the performance of the model by predicting the effects of single-mutations. We attribute increases in accuracy to the value of leveraging proximity within the enriched representation to identify sequentially and spatially close residues that would be affected by the mutation, using experimentally validated or predicted structures.
@techreport{mansoor_toward_2021,
	title = {Toward {More} {General} {Embeddings} for {Protein} {Design}: {Harnessing} {Joint} {Representations} of {Sequence} and {Structure}},
	copyright = {© 2021, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at http://creativecommons.org/licenses/by-nc-nd/4.0/},
	shorttitle = {Toward {More} {General} {Embeddings} for {Protein} {Design}},
	url = {https://www.biorxiv.org/content/10.1101/2021.09.01.458592v1},
	abstract = {Protein embeddings learned from aligned sequences have been leveraged in a wide array of tasks in protein understanding and engineering. The sequence embeddings are generated through semi-supervised training on millions of sequences with deep neural models defined with hundreds of millions of parameters, and they continue to increase in performance on target tasks with increasing complexity. We report a more data-efficient approach to encode protein information through joint training on protein sequence and structure in a semi-supervised manner. We show that the method is able to encode both types of information to form a rich embedding space which can be used for downstream prediction tasks. We show that the incorporation of rich structural information into the context under consideration boosts the performance of the model by predicting the effects of single-mutations. We attribute increases in accuracy to the value of leveraging proximity within the enriched representation to identify sequentially and spatially close residues that would be affected by the mutation, using experimentally validated or predicted structures.},
	language = {en},
	urldate = {2022-01-17},
	author = {Mansoor, Sanaa and Baek, Minkyung and Madan, Umesh and Horvitz, Eric},
	month = sep,
	year = {2021},
	doi = {10.1101/2021.09.01.458592},
	note = {Company: Cold Spring Harbor Laboratory
Distributor: Cold Spring Harbor Laboratory
Label: Cold Spring Harbor Laboratory
Section: New Results
Type: article},
	pages = {2021.09.01.458592},
}

Downloads: 0