Journal of Chemical Information and Modeling, 54(11):3056-3066, 11, 2014. Paper Website abstract bibtex
Compound activity data grow at unprecedented rates, and their complexity increases. This challenges compound data mining efforts and makes it difficult to draw reliable conclusions from data analysis. We have aimed to investigate the influence of individual parameters and data confidence levels on compound selection and property assessment. Therefore, alternative sets of bioactive compounds were systematically extracted from ChEMBL on the basis of iteratively expanding selection criteria with increasing stringency covering a variety of search parameters. The sequential application of criteria for the selection of high-confidence compound data was order-independent, as expected. Furthermore, the influence of separately applied selection criteria was analyzed. Criteria that largely influenced compound selection and compound promiscuity rates were identified. In the presence of stringent selection criteria and high data confidence, many compounds with likely assay artifacts or liabilities were eliminated from further consideration. Taken together, the findings of our analysis emphasize the need to carefully consider search parameters related to target organisms, confidence level of activity, and activity measurements and suggest reliable protocols for compound data mining.