Significance Tests for Bizarre Measures in 2-Class Classification Tasks. Keller, M.; Mari�thoz, J.; and Bengio, S. Technical Report 04-34, IDIAP, 2004.
Significance Tests for Bizarre Measures in 2-Class Classification Tasks [link]Paper  abstract   bibtex   
Statistical significance tests are often used in machine learning to compare the performance of two learning algorithms or two models. However, in most cases, one of the underlying assumptions behind these tests is that the error measure used to assess the performance of one model/algorithm is computed as the sum of errors obtained on each example of the test set. This is however not the case for several well-known measures such as $F_1$, used in text categorization, or DCF, used in person authentication. We propose here a practical methodology to either adapt the existing tests or develop non-parametric solutions for such \em bizarre measures. We furthermore assess the quality of these tests on a real-life large dataset.
@techreport{keller:2004:idiap:04-34,
  author = {M. Keller and J. Mari�thoz and S. Bengio},
  title = {Significance Tests for Bizarre Measures in 2-Class Classification Tasks},
  institution = {IDIAP},
  year = 2004,
  type = {Technical Report IDIAP-RR},
  number =   {04-34},
  url = {publications/ps/rr04-34.ps.gz},
  pdf = {publications/pdf/rr04-34.pdf},
  djvu = {publications/djvu/rr04-34.djvu},
  original = {2004/stat_tests_nips_rejected},
  topics = {biometric_authentication},
  abstract = {Statistical significance tests are often used in machine learning to compare the performance of two learning algorithms or two models. However, in most cases, one of the underlying assumptions behind these tests is that the error measure used to assess the performance of one model/algorithm is computed as the sum of errors obtained on each example of the test set. This is however not the case for several well-known measures such as $F_1$, used in text categorization, or DCF, used in person authentication. We propose here a practical methodology to either adapt the existing tests or develop non-parametric solutions for such {\em bizarre} measures.  We furthermore assess the quality of these tests on a real-life large dataset.},
  categorie = {E},
}
Downloads: 0