Training Highly Multiclass Classifiers. Gupta, M. R., Bengio, S., & Weston, J. Journal of Machine Learning Research, JMLR, 15:1461–1492, 2014.
Training Highly Multiclass Classifiers [link]Paper  abstract   bibtex   
Classification problems with thousands or more classes often have a large variance in the confusability between classes, and we show that the more-confusable classes add more noise to the empirical loss that is minimized during training. We propose an online solution that reduces the effect of highly confusable classes in training the classifier parameters, and focuses the training on pairs of classes that are easier to differentiate at any given time in the training. We also show that the adagrad method, recently proposed for automatically decreasing step sizes for convex stochastic gradient descent optimization, can also be profitably applied to the nonconvex optimization stochastic gradient descent training of a joint supervised dimensionality reduction and linear classifier. Experiments on ImageNet benchmark datasets and proprietary image recognition problems with 15,000 to 97,000 classes show substantial gains in classification accuracy compared to one-vs-all linear SVMs and Wsabie.
@article{gupta:2014:jmlr,
  author = {M. R. Gupta and S. Bengio and J. Weston},
  title = {Training Highly Multiclass Classifiers},
  journal = {Journal of Machine Learning Research, {JMLR}},
  volume = 15,
  pages = {1461--1492},
  year = 2014,
  web = {http://jmlr.org/papers/volume15/gupta14a/gupta14a.pdf},
  pdf = {publications/pdf/gupta_2014_jmlr.pdf},
  url = {publications/ps/gupta_2014_jmlr.ps.gz},
  djvu = {publications/djvu/gupta_2014_jmlr.djvu},
  topics = {large_scale,ranking},
  original = {2014/jmlr_wsabie},
  abstract = {Classification problems with thousands or more classes often have a large variance in the confusability between classes, and we show that the more-confusable classes add more noise to the empirical loss that is minimized during training. We propose an online solution that reduces the effect of highly confusable classes in training the classifier parameters, and focuses the training on pairs of classes that are easier to differentiate at any given time in the training.  We also show that the adagrad method, recently proposed for automatically decreasing step sizes for convex stochastic gradient descent optimization, can also be profitably applied to the nonconvex optimization stochastic gradient descent training of a joint supervised dimensionality reduction and linear classifier. Experiments on ImageNet benchmark datasets and proprietary image recognition problems with 15,000 to 97,000 classes show substantial gains in classification accuracy compared to one-vs-all linear SVMs and Wsabie.},
  categorie = {A}
}

Downloads: 0