Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR

Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR. Sheridan, R., Feuston, B., Maiorov, V., & Kearsley, S. J.~Chem.~Inf.~Comput.~Sci., 44(6):1912--1928, 2004.

Paper doi abstract bibtex

How well can a QSAR model predict the activity of a molecule not in the training set used to create the model? A set of retrospective cross-validation experiments using 20 diverse in-house activity sets were done to find a good discriminator of prediction accuracy as measured by root-mean-square difference between observed and predicted activity. Among the measures we tested, two seem useful: the similarity of the molecule to be predicted to the nearest molecule in the training set and/or the number of neighbors in the training set, where neighbors are those more similar than a user-chosen cutoff. The molecules with the highest similarity and/or the most neighbors are the best-predicted. This trend holds true for narrow training sets and, to a lesser degree, for many diverse training sets and does not depend on which QSAR method or descriptor is used. One may define the similarity using a different descriptor than that used for the QSAR model. The similarity dependence for diverse training sets is somewhat unexpected. It appears to be greater for those data sets where the association of similar activities vs similar structures (as encoded in the Patterson plot) is stronger. We propose a way to estimate the reliability of the prediction of an arbitrary chemical structure on a given QSAR model, given the training set from which the model was derived.

@article{Sheridan:2004aa,
	Abstract = {How well can a QSAR model predict the activity of a molecule not in
	the training set used to create the model? A set of retrospective
	cross-validation experiments using 20 diverse in-house activity sets
	were done to find a good discriminator of prediction accuracy as
	measured by root-mean-square difference between observed and predicted
	activity. Among the measures we tested, two seem useful: the similarity
	of the molecule to be predicted to the nearest molecule in the training
	set and/or the number of neighbors in the training set, where neighbors
	are those more similar than a user-chosen cutoff. The molecules with
	the highest similarity and/or the most neighbors are the best-predicted.
	This trend holds true for narrow training sets and, to a lesser degree,
	for many diverse training sets and does not depend on which QSAR
	method or descriptor is used. One may define the similarity using
	a different descriptor than that used for the QSAR model. The similarity
	dependence for diverse training sets is somewhat unexpected. It appears
	to be greater for those data sets where the association of similar
	activities vs similar structures (as encoded in the Patterson plot)
	is stronger. We propose a way to estimate the reliability of the
	prediction of an arbitrary chemical structure on a given QSAR model,
	given the training set from which the model was derived.},
	Author = {Sheridan, R.P. and Feuston, B.P. and Maiorov, V.N. and Kearsley, S.K.},
	Date-Added = {2007-12-11 17:01:03 -0500},
	Date-Modified = {2009-04-16 15:15:10 -0400},
	Doi = {10.1021/ci049782w},
	Journal = {J.~Chem.~Inf.~Comput.~Sci.},
	Keywords = {domain applicability; qsar},
	Number = {6},
	Owner = {rajarshi},
	Pages = {1912--1928},
	Pmid = {15554660},
	Timestamp = {2007.04.16},
	Title = {Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR},
	Url = {http://dx.doi.org/10.1021/ci049782w},
	Volume = {44},
	Year = {2004},
	Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUIJidUJHRvcFgkb2JqZWN0c1gkdmVyc2lvblkkYXJjaGl2ZXLRBgdUcm9vdIABqAkKFRYXGyIjVSRudWxs0wsMDQ4RElpOUy5vYmplY3RzViRjbGFzc1dOUy5rZXlzog8QgASABoAHohMUgAKAA1lhbGlhc0RhdGFccmVsYXRpdmVQYXRo0hgMGRpXTlMuZGF0YU8RAWwAAAAAAWwAAgAAA212IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMU5bQNIKwAAABCNbQxzaGVyaWRhbi5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEI/kwl+gGwAAAAAAAAAAAAIAAwAACSAAAAAAAAAAAAAAAAAAAAAIYXJ0aWNsZXMAEAAIAADFObNTAAAAEQAIAADCX9hbAAAAAQAQABCNbQAKTIAACkxpAAB8EwACAC9tdiA6VXNlcnM6cmd1aGE6RG9jdW1lbnRzOmFydGljbGVzOnNoZXJpZGFuLnBkZgAADgAaAAwAcwBoAGUAcgBpAGQAYQBuAC4AcABkAGYADwAIAAMAbQB2ACAAEgArVXNlcnMvcmd1aGEvRG9jdW1lbnRzL2FydGljbGVzL3NoZXJpZGFuLnBkZgAAEwABLwAAFQACAAz//wAAgAXSHB0eH1gkY2xhc3Nlc1okY2xhc3NuYW1lox8gIV1OU011dGFibGVEYXRhVk5TRGF0YVhOU09iamVjdF8QJS4uLy4uL0RvY3VtZW50cy9hcnRpY2xlcy9zaGVyaWRhbi5wZGbSHB0kJaIlIVxOU0RpY3Rpb25hcnkSAAGGoF8QD05TS2V5ZWRBcmNoaXZlcgAIABEAFgAfACgAMgA1ADoAPABFAEsAUgBdAGQAbABvAHEAcwB1AHgAegB8AIYAkwCYAKACEAISAhcCIAIrAi8CPQJEAk0CdQJ6An0CigKPAAAAAAAAAgEAAAAAAAAAKAAAAAAAAAAAAAAAAAAAAqE=},
	Bdsk-Url-1 = {http://dx.doi.org/10.1021/ci049782w},
	Bdsk-Url-2 = {http://dx.doi.org/10.1021/ci049782w}}

Downloads: 0

{"_id":"CcciKjpmW2bXqgpYi","bibbaseid":"sheridan-feuston-maiorov-kearsley-similaritytomoleculesinthetrainingsetisagooddiscriminatorforpredictionaccuracyinqsar-2004","downloads":0,"creationDate":"2016-02-18T13:03:38.034Z","title":"Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR","author_short":["Sheridan, R.","Feuston, B.","Maiorov, V.","Kearsley, S."],"year":2004,"bibtype":"article","biburl":"https://dl.dropboxusercontent.com/u/26998770/main.bib","bibdata":{"bibtype":"article","type":"article","abstract":"How well can a QSAR model predict the activity of a molecule not in the training set used to create the model? A set of retrospective cross-validation experiments using 20 diverse in-house activity sets were done to find a good discriminator of prediction accuracy as measured by root-mean-square difference between observed and predicted activity. Among the measures we tested, two seem useful: the similarity of the molecule to be predicted to the nearest molecule in the training set and/or the number of neighbors in the training set, where neighbors are those more similar than a user-chosen cutoff. The molecules with the highest similarity and/or the most neighbors are the best-predicted. This trend holds true for narrow training sets and, to a lesser degree, for many diverse training sets and does not depend on which QSAR method or descriptor is used. One may define the similarity using a different descriptor than that used for the QSAR model. The similarity dependence for diverse training sets is somewhat unexpected. It appears to be greater for those data sets where the association of similar activities vs similar structures (as encoded in the Patterson plot) is stronger. We propose a way to estimate the reliability of the prediction of an arbitrary chemical structure on a given QSAR model, given the training set from which the model was derived.","author":[{"propositions":[],"lastnames":["Sheridan"],"firstnames":["R.P."],"suffixes":[]},{"propositions":[],"lastnames":["Feuston"],"firstnames":["B.P."],"suffixes":[]},{"propositions":[],"lastnames":["Maiorov"],"firstnames":["V.N."],"suffixes":[]},{"propositions":[],"lastnames":["Kearsley"],"firstnames":["S.K."],"suffixes":[]}],"date-added":"2007-12-11 17:01:03 -0500","date-modified":"2009-04-16 15:15:10 -0400","doi":"10.1021/ci049782w","journal":"J.~Chem.~Inf.~Comput.~Sci.","keywords":"domain applicability; qsar","number":"6","owner":"rajarshi","pages":"1912--1928","pmid":"15554660","timestamp":"2007.04.16","title":"Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR","url":"http://dx.doi.org/10.1021/ci049782w","volume":"44","year":"2004","bdsk-file-1":"YnBsaXN0MDDUAQIDBAUIJidUJHRvcFgkb2JqZWN0c1gkdmVyc2lvblkkYXJjaGl2ZXLRBgdUcm9vdIABqAkKFRYXGyIjVSRudWxs0wsMDQ4RElpOUy5vYmplY3RzViRjbGFzc1dOUy5rZXlzog8QgASABoAHohMUgAKAA1lhbGlhc0RhdGFccmVsYXRpdmVQYXRo0hgMGRpXTlMuZGF0YU8RAWwAAAAAAWwAAgAAA212IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMU5bQNIKwAAABCNbQxzaGVyaWRhbi5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEI/kwl+gGwAAAAAAAAAAAAIAAwAACSAAAAAAAAAAAAAAAAAAAAAIYXJ0aWNsZXMAEAAIAADFObNTAAAAEQAIAADCX9hbAAAAAQAQABCNbQAKTIAACkxpAAB8EwACAC9tdiA6VXNlcnM6cmd1aGE6RG9jdW1lbnRzOmFydGljbGVzOnNoZXJpZGFuLnBkZgAADgAaAAwAcwBoAGUAcgBpAGQAYQBuAC4AcABkAGYADwAIAAMAbQB2ACAAEgArVXNlcnMvcmd1aGEvRG9jdW1lbnRzL2FydGljbGVzL3NoZXJpZGFuLnBkZgAAEwABLwAAFQACAAz//wAAgAXSHB0eH1gkY2xhc3Nlc1okY2xhc3NuYW1lox8gIV1OU011dGFibGVEYXRhVk5TRGF0YVhOU09iamVjdF8QJS4uLy4uL0RvY3VtZW50cy9hcnRpY2xlcy9zaGVyaWRhbi5wZGbSHB0kJaIlIVxOU0RpY3Rpb25hcnkSAAGGoF8QD05TS2V5ZWRBcmNoaXZlcgAIABEAFgAfACgAMgA1ADoAPABFAEsAUgBdAGQAbABvAHEAcwB1AHgAegB8AIYAkwCYAKACEAISAhcCIAIrAi8CPQJEAk0CdQJ6An0CigKPAAAAAAAAAgEAAAAAAAAAKAAAAAAAAAAAAAAAAAAAAqE=","bdsk-url-1":"http://dx.doi.org/10.1021/ci049782w","bdsk-url-2":"http://dx.doi.org/10.1021/ci049782w","bibtex":"@article{Sheridan:2004aa,\n\tAbstract = {How well can a QSAR model predict the activity of a molecule not in\n\tthe training set used to create the model? A set of retrospective\n\tcross-validation experiments using 20 diverse in-house activity sets\n\twere done to find a good discriminator of prediction accuracy as\n\tmeasured by root-mean-square difference between observed and predicted\n\tactivity. Among the measures we tested, two seem useful: the similarity\n\tof the molecule to be predicted to the nearest molecule in the training\n\tset and/or the number of neighbors in the training set, where neighbors\n\tare those more similar than a user-chosen cutoff. The molecules with\n\tthe highest similarity and/or the most neighbors are the best-predicted.\n\tThis trend holds true for narrow training sets and, to a lesser degree,\n\tfor many diverse training sets and does not depend on which QSAR\n\tmethod or descriptor is used. One may define the similarity using\n\ta different descriptor than that used for the QSAR model. The similarity\n\tdependence for diverse training sets is somewhat unexpected. It appears\n\tto be greater for those data sets where the association of similar\n\tactivities vs similar structures (as encoded in the Patterson plot)\n\tis stronger. We propose a way to estimate the reliability of the\n\tprediction of an arbitrary chemical structure on a given QSAR model,\n\tgiven the training set from which the model was derived.},\n\tAuthor = {Sheridan, R.P. and Feuston, B.P. and Maiorov, V.N. and Kearsley, S.K.},\n\tDate-Added = {2007-12-11 17:01:03 -0500},\n\tDate-Modified = {2009-04-16 15:15:10 -0400},\n\tDoi = {10.1021/ci049782w},\n\tJournal = {J.~Chem.~Inf.~Comput.~Sci.},\n\tKeywords = {domain applicability; qsar},\n\tNumber = {6},\n\tOwner = {rajarshi},\n\tPages = {1912--1928},\n\tPmid = {15554660},\n\tTimestamp = {2007.04.16},\n\tTitle = {Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR},\n\tUrl = {http://dx.doi.org/10.1021/ci049782w},\n\tVolume = {44},\n\tYear = {2004},\n\tBdsk-File-1 = {YnBsaXN0MDDUAQIDBAUIJidUJHRvcFgkb2JqZWN0c1gkdmVyc2lvblkkYXJjaGl2ZXLRBgdUcm9vdIABqAkKFRYXGyIjVSRudWxs0wsMDQ4RElpOUy5vYmplY3RzViRjbGFzc1dOUy5rZXlzog8QgASABoAHohMUgAKAA1lhbGlhc0RhdGFccmVsYXRpdmVQYXRo0hgMGRpXTlMuZGF0YU8RAWwAAAAAAWwAAgAAA212IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMU5bQNIKwAAABCNbQxzaGVyaWRhbi5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEI/kwl+gGwAAAAAAAAAAAAIAAwAACSAAAAAAAAAAAAAAAAAAAAAIYXJ0aWNsZXMAEAAIAADFObNTAAAAEQAIAADCX9hbAAAAAQAQABCNbQAKTIAACkxpAAB8EwACAC9tdiA6VXNlcnM6cmd1aGE6RG9jdW1lbnRzOmFydGljbGVzOnNoZXJpZGFuLnBkZgAADgAaAAwAcwBoAGUAcgBpAGQAYQBuAC4AcABkAGYADwAIAAMAbQB2ACAAEgArVXNlcnMvcmd1aGEvRG9jdW1lbnRzL2FydGljbGVzL3NoZXJpZGFuLnBkZgAAEwABLwAAFQACAAz//wAAgAXSHB0eH1gkY2xhc3Nlc1okY2xhc3NuYW1lox8gIV1OU011dGFibGVEYXRhVk5TRGF0YVhOU09iamVjdF8QJS4uLy4uL0RvY3VtZW50cy9hcnRpY2xlcy9zaGVyaWRhbi5wZGbSHB0kJaIlIVxOU0RpY3Rpb25hcnkSAAGGoF8QD05TS2V5ZWRBcmNoaXZlcgAIABEAFgAfACgAMgA1ADoAPABFAEsAUgBdAGQAbABvAHEAcwB1AHgAegB8AIYAkwCYAKACEAISAhcCIAIrAi8CPQJEAk0CdQJ6An0CigKPAAAAAAAAAgEAAAAAAAAAKAAAAAAAAAAAAAAAAAAAAqE=},\n\tBdsk-Url-1 = {http://dx.doi.org/10.1021/ci049782w},\n\tBdsk-Url-2 = {http://dx.doi.org/10.1021/ci049782w}}\n\n","author_short":["Sheridan, R.","Feuston, B.","Maiorov, V.","Kearsley, S."],"key":"Sheridan:2004aa","id":"Sheridan:2004aa","bibbaseid":"sheridan-feuston-maiorov-kearsley-similaritytomoleculesinthetrainingsetisagooddiscriminatorforpredictionaccuracyinqsar-2004","role":"author","urls":{"Paper":"http://dx.doi.org/10.1021/ci049782w"},"keyword":["domain applicability; qsar"],"downloads":0},"search_terms":["similarity","molecules","training","set","good","discriminator","prediction","accuracy","qsar","sheridan","feuston","maiorov","kearsley"],"keywords":["domain applicability; qsar"],"authorIDs":[],"dataSources":["c5japf9eAQRaeMS4h"]}