Significance Tests for Bizarre Measures in 2-Class Classification Tasks

Significance Tests for Bizarre Measures in 2-Class Classification Tasks. Keller, M., Mari�thoz, J., & Bengio, S. Technical Report 04-34, IDIAP, 2004.

Paper abstract bibtex

Statistical significance tests are often used in machine learning to compare the performance of two learning algorithms or two models. However, in most cases, one of the underlying assumptions behind these tests is that the error measure used to assess the performance of one model/algorithm is computed as the sum of errors obtained on each example of the test set. This is however not the case for several well-known measures such as $F_1$, used in text categorization, or DCF, used in person authentication. We propose here a practical methodology to either adapt the existing tests or develop non-parametric solutions for such \em bizarre measures. We furthermore assess the quality of these tests on a real-life large dataset.

@techreport{keller:2004:idiap:04-34,
  author = {M. Keller and J. Mari�thoz and S. Bengio},
  title = {Significance Tests for Bizarre Measures in 2-Class Classification Tasks},
  institution = {IDIAP},
  year = 2004,
  type = {Technical Report IDIAP-RR},
  number =   {04-34},
  url = {publications/ps/rr04-34.ps.gz},
  pdf = {publications/pdf/rr04-34.pdf},
  djvu = {publications/djvu/rr04-34.djvu},
  original = {2004/stat_tests_nips_rejected},
  topics = {biometric_authentication},
  abstract = {Statistical significance tests are often used in machine learning to compare the performance of two learning algorithms or two models. However, in most cases, one of the underlying assumptions behind these tests is that the error measure used to assess the performance of one model/algorithm is computed as the sum of errors obtained on each example of the test set. This is however not the case for several well-known measures such as $F_1$, used in text categorization, or DCF, used in person authentication. We propose here a practical methodology to either adapt the existing tests or develop non-parametric solutions for such {\em bizarre} measures.  We furthermore assess the quality of these tests on a real-life large dataset.},
  categorie = {E},
}

Downloads: 0

{"_id":"HnEbLRk6c2DFzy8Xy","bibbaseid":"keller-marithoz-bengio-significancetestsforbizarremeasuresin2classclassificationtasks-2004","authorIDs":[],"author_short":["Keller, M.","Mari�thoz, J.","Bengio, S."],"bibdata":{"bibtype":"techreport","type":"Technical Report IDIAP-RR","author":[{"firstnames":["M."],"propositions":[],"lastnames":["Keller"],"suffixes":[]},{"firstnames":["J."],"propositions":[],"lastnames":["Mari�thoz"],"suffixes":[]},{"firstnames":["S."],"propositions":[],"lastnames":["Bengio"],"suffixes":[]}],"title":"Significance Tests for Bizarre Measures in 2-Class Classification Tasks","institution":"IDIAP","year":"2004","number":"04-34","url":"publications/ps/rr04-34.ps.gz","pdf":"publications/pdf/rr04-34.pdf","djvu":"publications/djvu/rr04-34.djvu","original":"2004/stat_tests_nips_rejected","topics":"biometric_authentication","abstract":"Statistical significance tests are often used in machine learning to compare the performance of two learning algorithms or two models. However, in most cases, one of the underlying assumptions behind these tests is that the error measure used to assess the performance of one model/algorithm is computed as the sum of errors obtained on each example of the test set. This is however not the case for several well-known measures such as $F_1$, used in text categorization, or DCF, used in person authentication. We propose here a practical methodology to either adapt the existing tests or develop non-parametric solutions for such \\em bizarre measures. We furthermore assess the quality of these tests on a real-life large dataset.","categorie":"E","bibtex":"@techreport{keller:2004:idiap:04-34,\n author = {M. Keller and J. Mari�thoz and S. Bengio},\n title = {Significance Tests for Bizarre Measures in 2-Class Classification Tasks},\n institution = {IDIAP},\n year = 2004,\n type = {Technical Report IDIAP-RR},\n number = {04-34},\n url = {publications/ps/rr04-34.ps.gz},\n pdf = {publications/pdf/rr04-34.pdf},\n djvu = {publications/djvu/rr04-34.djvu},\n original = {2004/stat_tests_nips_rejected},\n topics = {biometric_authentication},\n abstract = {Statistical significance tests are often used in machine learning to compare the performance of two learning algorithms or two models. However, in most cases, one of the underlying assumptions behind these tests is that the error measure used to assess the performance of one model/algorithm is computed as the sum of errors obtained on each example of the test set. This is however not the case for several well-known measures such as $F_1$, used in text categorization, or DCF, used in person authentication. We propose here a practical methodology to either adapt the existing tests or develop non-parametric solutions for such {\\em bizarre} measures. We furthermore assess the quality of these tests on a real-life large dataset.},\n categorie = {E},\n} \n\n","author_short":["Keller, M.","Mari�thoz, J.","Bengio, S."],"key":"keller:2004:idiap:04-34","id":"keller:2004:idiap:04-34","bibbaseid":"keller-marithoz-bengio-significancetestsforbizarremeasuresin2classclassificationtasks-2004","role":"author","urls":{"Paper":"http://bengio.abracadoudou.com/publications/ps/rr04-34.ps.gz"},"downloads":0},"bibtype":"techreport","biburl":"http://bengio.abracadoudou.com/samy.bib","creationDate":"2020-03-18T03:43:27.511Z","downloads":0,"keywords":[],"search_terms":["significance","tests","bizarre","measures","class","classification","tasks","keller","mari�thoz","bengio"],"title":"Significance Tests for Bizarre Measures in 2-Class Classification Tasks","year":2004,"dataSources":["9NCW2CDr4M3s5DvNX"]}