Variable Selection in Data Mining

Variable Selection in Data Mining. Foster, D. P. & Stine, R. A. JASA, 99(466):303-313, Taylor & Francis, June, 2004.
doi abstract bibtex

We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our dataset of 2.9 million months of credit-card activity. We use stepwise selection to find predictors of these from a mix of payment history, debt load, demographics, and their interactions. This combination of rare responses and over 67,000 possible predictors leads to a challenging modeling question: How does one separate coincidental from useful predictors? We show that three modifications turn stepwise regression into an effective methodology for predicting bankruptcy. Our version of stepwise regression (1) organizes calculations to accommodate interactions, (2) exploits modern decision theoretic criteria to choose predictors, and (3) conservatively estimates p-values to handle sparse data and a binary response. Omitting any one of these leads to poor performance. A final step in our procedure calibrates regression predictions. With these modifications, stepwise regression predicts bankruptcy as well as, if not better than, recently developed data-mining tools. When sorted, the largest 14,000 resulting predictions hold 1,000 of the 1,800 bankruptcies hidden in a validation sample of 2.3 million observations. If the cost of missing a bankruptcy is 200 times that of a false positive, our predictions incur less than 2/3 of the costs of classification errors produced by the tree-based classifier C4.5.

@article{fos04var,
  title = {Variable {{Selection}} in {{Data Mining}}},
  volume = {99},
  issn = {0162-1459},
  abstract = {We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our dataset of 2.9 million months of credit-card activity. We use stepwise selection to find predictors of these from a mix of payment history, debt load, demographics, and their interactions. This combination of rare responses and over 67,000 possible predictors leads to a challenging modeling question: How does one separate coincidental from useful predictors? We show that three modifications turn stepwise regression into an effective methodology for predicting bankruptcy. Our version of stepwise regression (1) organizes calculations to accommodate interactions, (2) exploits modern decision theoretic criteria to choose predictors, and (3) conservatively estimates p-values to handle sparse data and a binary response. Omitting any one of these leads to poor performance. A final step in our procedure calibrates regression predictions. With these modifications, stepwise regression predicts bankruptcy as well as, if not better than, recently developed data-mining tools. When sorted, the largest 14,000 resulting predictions hold 1,000 of the 1,800 bankruptcies hidden in a validation sample of 2.3 million observations. If the cost of missing a bankruptcy is 200 times that of a false positive, our predictions incur less than 2/3 of the costs of classification errors produced by the tree-based classifier C4.5.},
  number = {466},
  journal = {JASA},
  doi = {10.1198/016214504000000287},
  author = {Foster, Dean P. and Stine, Robert A.},
  month = jun,
  year = {2004},
  keywords = {shrinkage,variable-selection},
  pages = {303-313},
  publisher = {Taylor & Francis},
  citeulike-article-id = {7963665},
  citeulike-attachment-1 = {fos04var.pdf; /pdf/user/harrelfe/article/7963665/1027679/fos04var.pdf; 520a58310049bd7c4dea67026f7f0a52d83d8b36},
  citeulike-linkout-0 = {http://dx.doi.org/10.1198/016214504000000287},
  citeulike-linkout-1 = {http://www.tandfonline.com/doi/abs/10.1198/016214504000000287},
  day = {1},
  posted-at = {2015-07-23 21:08:57},
  priority = {0},
  annote = {promising recalibration of naiive stepwise variable selection in high-dimensional context}
}

Downloads: 0

{"_id":"M9cZ6bLaNzkSDoXT8","bibbaseid":"foster-stine-variableselectionindatamining-2004","downloads":0,"creationDate":"2018-06-23T20:06:33.743Z","title":"Variable Selection in Data Mining","author_short":["Foster, D. P.","Stine, R. A."],"year":2004,"bibtype":"article","biburl":"http://hbiostat.org/bib/harrelfe.bib","bibdata":{"bibtype":"article","type":"article","title":"Variable Selection in Data Mining","volume":"99","issn":"0162-1459","abstract":"We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our dataset of 2.9 million months of credit-card activity. We use stepwise selection to find predictors of these from a mix of payment history, debt load, demographics, and their interactions. This combination of rare responses and over 67,000 possible predictors leads to a challenging modeling question: How does one separate coincidental from useful predictors? We show that three modifications turn stepwise regression into an effective methodology for predicting bankruptcy. Our version of stepwise regression (1) organizes calculations to accommodate interactions, (2) exploits modern decision theoretic criteria to choose predictors, and (3) conservatively estimates p-values to handle sparse data and a binary response. Omitting any one of these leads to poor performance. A final step in our procedure calibrates regression predictions. With these modifications, stepwise regression predicts bankruptcy as well as, if not better than, recently developed data-mining tools. When sorted, the largest 14,000 resulting predictions hold 1,000 of the 1,800 bankruptcies hidden in a validation sample of 2.3 million observations. If the cost of missing a bankruptcy is 200 times that of a false positive, our predictions incur less than 2/3 of the costs of classification errors produced by the tree-based classifier C4.5.","number":"466","journal":"JASA","doi":"10.1198/016214504000000287","author":[{"propositions":[],"lastnames":["Foster"],"firstnames":["Dean","P."],"suffixes":[]},{"propositions":[],"lastnames":["Stine"],"firstnames":["Robert","A."],"suffixes":[]}],"month":"June","year":"2004","keywords":"shrinkage,variable-selection","pages":"303-313","publisher":"Taylor & Francis","citeulike-article-id":"7963665","citeulike-attachment-1":"fos04var.pdf; /pdf/user/harrelfe/article/7963665/1027679/fos04var.pdf; 520a58310049bd7c4dea67026f7f0a52d83d8b36","citeulike-linkout-0":"http://dx.doi.org/10.1198/016214504000000287","citeulike-linkout-1":"http://www.tandfonline.com/doi/abs/10.1198/016214504000000287","day":"1","posted-at":"2015-07-23 21:08:57","priority":"0","annote":"promising recalibration of naiive stepwise variable selection in high-dimensional context","bibtex":"@article{fos04var,\n title = {Variable {{Selection}} in {{Data Mining}}},\n volume = {99},\n issn = {0162-1459},\n abstract = {We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our dataset of 2.9 million months of credit-card activity. We use stepwise selection to find predictors of these from a mix of payment history, debt load, demographics, and their interactions. This combination of rare responses and over 67,000 possible predictors leads to a challenging modeling question: How does one separate coincidental from useful predictors? We show that three modifications turn stepwise regression into an effective methodology for predicting bankruptcy. Our version of stepwise regression (1) organizes calculations to accommodate interactions, (2) exploits modern decision theoretic criteria to choose predictors, and (3) conservatively estimates p-values to handle sparse data and a binary response. Omitting any one of these leads to poor performance. A final step in our procedure calibrates regression predictions. With these modifications, stepwise regression predicts bankruptcy as well as, if not better than, recently developed data-mining tools. When sorted, the largest 14,000 resulting predictions hold 1,000 of the 1,800 bankruptcies hidden in a validation sample of 2.3 million observations. If the cost of missing a bankruptcy is 200 times that of a false positive, our predictions incur less than 2/3 of the costs of classification errors produced by the tree-based classifier C4.5.},\n number = {466},\n journal = {JASA},\n doi = {10.1198/016214504000000287},\n author = {Foster, Dean P. and Stine, Robert A.},\n month = jun,\n year = {2004},\n keywords = {shrinkage,variable-selection},\n pages = {303-313},\n publisher = {Taylor & Francis},\n citeulike-article-id = {7963665},\n citeulike-attachment-1 = {fos04var.pdf; /pdf/user/harrelfe/article/7963665/1027679/fos04var.pdf; 520a58310049bd7c4dea67026f7f0a52d83d8b36},\n citeulike-linkout-0 = {http://dx.doi.org/10.1198/016214504000000287},\n citeulike-linkout-1 = {http://www.tandfonline.com/doi/abs/10.1198/016214504000000287},\n day = {1},\n posted-at = {2015-07-23 21:08:57},\n priority = {0},\n annote = {promising recalibration of naiive stepwise variable selection in high-dimensional context}\n}\n\n","author_short":["Foster, D. P.","Stine, R. A."],"key":"fos04var","id":"fos04var","bibbaseid":"foster-stine-variableselectionindatamining-2004","role":"author","urls":{},"keyword":["shrinkage","variable-selection"],"downloads":0},"search_terms":["variable","selection","data","mining","foster","stine"],"keywords":["shrinkage","variable-selection"],"authorIDs":[],"dataSources":["mEQakjn8ggpMsnGJi"]}