Beyond the E -Value: Stratified Statistics for Protein Domain Prediction. Ochoa, A., Storey, J. D, Llinás, M., & Singh, M. PLoS Computational Biology, 11(11):e1004509, November, 2015.
Paper doi abstract bibtex Author Summary Despite decades of research, it remains a challenge to distinguish homologous relationships between proteins from sequence similarities arising due to chance alone. This is an increasingly important problem as sequence database sizes continue to grow, and even today many computational analyses require that the statistics of billions of sequence comparisons be assessed automatically. Here we explore statistical significance evaluation on data that is stratified—that is, naturally partitioned into subsets that may differ in their amount of signal—and find a theoretically optimal criterion for automatically setting thresholds of significance for each stratum. For the task of domain prediction, an important component of efforts to annotate protein sequences and identify remote sequence homologs, we empirically show that our stratified analysis of statistical significance greatly improves upon a combined analysis. Further, we identify weaknesses in the prevailing random sequence model for assessing statistical significance for a small subset of domain families with repetitive sequence patterns and known biological, structural, and evolutionary properties. Our theoretical findings in statistics are relevant not only for identifying protein domains, but for arbitrary stratified problems in genomics and beyond.
@article{ochoa_beyond_2015,
title = {Beyond the {E} -{Value}: {Stratified} {Statistics} for {Protein} {Domain} {Prediction}},
volume = {11},
url = {http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004509},
doi = {10.1371/journal.pcbi.1004509},
abstract = {Author Summary Despite decades of research, it remains a challenge to distinguish homologous relationships between proteins from sequence similarities arising due to chance alone. This is an increasingly important problem as sequence database sizes continue to grow, and even today many computational analyses require that the statistics of billions of sequence comparisons be assessed automatically. Here we explore statistical significance evaluation on data that is stratified—that is, naturally partitioned into subsets that may differ in their amount of signal—and find a theoretically optimal criterion for automatically setting thresholds of significance for each stratum. For the task of domain prediction, an important component of efforts to annotate protein sequences and identify remote sequence homologs, we empirically show that our stratified analysis of statistical significance greatly improves upon a combined analysis. Further, we identify weaknesses in the prevailing random sequence model for assessing statistical significance for a small subset of domain families with repetitive sequence patterns and known biological, structural, and evolutionary properties. Our theoretical findings in statistics are relevant not only for identifying protein domains, but for arbitrary stratified problems in genomics and beyond.},
language = {English},
number = {11},
journal = {PLoS Computational Biology},
author = {Ochoa, Alejandro and Storey, John D and Llinás, Manuel and Singh, Mona},
month = nov,
year = {2015},
pages = {e1004509},
}
Downloads: 0
{"_id":"ta9hoPyJQznsw6g46","bibbaseid":"ochoa-storey-llins-singh-beyondtheevaluestratifiedstatisticsforproteindomainprediction-2015","author_short":["Ochoa, A.","Storey, J. D","Llinás, M.","Singh, M."],"bibdata":{"bibtype":"article","type":"article","title":"Beyond the E -Value: Stratified Statistics for Protein Domain Prediction","volume":"11","url":"http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004509","doi":"10.1371/journal.pcbi.1004509","abstract":"Author Summary Despite decades of research, it remains a challenge to distinguish homologous relationships between proteins from sequence similarities arising due to chance alone. This is an increasingly important problem as sequence database sizes continue to grow, and even today many computational analyses require that the statistics of billions of sequence comparisons be assessed automatically. Here we explore statistical significance evaluation on data that is stratified—that is, naturally partitioned into subsets that may differ in their amount of signal—and find a theoretically optimal criterion for automatically setting thresholds of significance for each stratum. For the task of domain prediction, an important component of efforts to annotate protein sequences and identify remote sequence homologs, we empirically show that our stratified analysis of statistical significance greatly improves upon a combined analysis. Further, we identify weaknesses in the prevailing random sequence model for assessing statistical significance for a small subset of domain families with repetitive sequence patterns and known biological, structural, and evolutionary properties. Our theoretical findings in statistics are relevant not only for identifying protein domains, but for arbitrary stratified problems in genomics and beyond.","language":"English","number":"11","journal":"PLoS Computational Biology","author":[{"propositions":[],"lastnames":["Ochoa"],"firstnames":["Alejandro"],"suffixes":[]},{"propositions":[],"lastnames":["Storey"],"firstnames":["John","D"],"suffixes":[]},{"propositions":[],"lastnames":["Llinás"],"firstnames":["Manuel"],"suffixes":[]},{"propositions":[],"lastnames":["Singh"],"firstnames":["Mona"],"suffixes":[]}],"month":"November","year":"2015","pages":"e1004509","bibtex":"@article{ochoa_beyond_2015,\n\ttitle = {Beyond the {E} -{Value}: {Stratified} {Statistics} for {Protein} {Domain} {Prediction}},\n\tvolume = {11},\n\turl = {http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004509},\n\tdoi = {10.1371/journal.pcbi.1004509},\n\tabstract = {Author Summary Despite decades of research, it remains a challenge to distinguish homologous relationships between proteins from sequence similarities arising due to chance alone. This is an increasingly important problem as sequence database sizes continue to grow, and even today many computational analyses require that the statistics of billions of sequence comparisons be assessed automatically. Here we explore statistical significance evaluation on data that is stratified—that is, naturally partitioned into subsets that may differ in their amount of signal—and find a theoretically optimal criterion for automatically setting thresholds of significance for each stratum. For the task of domain prediction, an important component of efforts to annotate protein sequences and identify remote sequence homologs, we empirically show that our stratified analysis of statistical significance greatly improves upon a combined analysis. Further, we identify weaknesses in the prevailing random sequence model for assessing statistical significance for a small subset of domain families with repetitive sequence patterns and known biological, structural, and evolutionary properties. Our theoretical findings in statistics are relevant not only for identifying protein domains, but for arbitrary stratified problems in genomics and beyond.},\n\tlanguage = {English},\n\tnumber = {11},\n\tjournal = {PLoS Computational Biology},\n\tauthor = {Ochoa, Alejandro and Storey, John D and Llinás, Manuel and Singh, Mona},\n\tmonth = nov,\n\tyear = {2015},\n\tpages = {e1004509},\n}\n\n\n\n\n\n\n\n","author_short":["Ochoa, A.","Storey, J. D","Llinás, M.","Singh, M."],"key":"ochoa_beyond_2015","id":"ochoa_beyond_2015","bibbaseid":"ochoa-storey-llins-singh-beyondtheevaluestratifiedstatisticsforproteindomainprediction-2015","role":"author","urls":{"Paper":"http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004509"},"metadata":{"authorlinks":{}},"html":""},"bibtype":"article","biburl":"https://bibbase.org/zotero/kountour","dataSources":["jYHT7wnpj5jtkdx3t","MnayAXw3qciX87bz7"],"keywords":[],"search_terms":["beyond","value","stratified","statistics","protein","domain","prediction","ochoa","storey","llinás","singh"],"title":"Beyond the E -Value: Stratified Statistics for Protein Domain Prediction","year":2015}