Escaping the Gravitational Pull of Softmax

Escaping the Gravitational Pull of Softmax. Mei, J., Xiao, C, Dai, B., Li, L., Szepesvári, C., & Schuurmans, D. In NeurIPS, 12, 2020.

Url

Escaping the Gravitational Pull of Softmax [pdf]

Paper abstract bibtex 48 downloads

The softmax is the standard transformation used in machine learning to map real-valued vectors to categorical distributions. Unfortunately, this transform poses serious drawbacks for gradient descent (ascent) optimization. We reveal this difficulty by establishing two negative results: (1) optimizing any expectation with respect to the softmax must exhibit sensitivity to parameter initialization (softmax gravity well), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence (softmax damping). Both findings are based on an analysis of convergence rates using the Non-uniform Łojasiewicz (NŁ) inequalities. To circumvent these shortcomings we investigate an alternative transformation, the escort mapping, that demonstrates better optimization properties. The disadvantages of the softmax and the effectiveness of the escort transformation are further explained using the concept of NŁ coefficient. In addition to proving bounds on convergence rates to firmly establish these results, we also provide experimental evidence for the superiority of the escort transformation.

@inproceedings{MXDLSzS20,
	abstract = {The softmax is the standard transformation used in machine learning to map real-valued vectors to categorical distributions. Unfortunately, this transform poses serious drawbacks for gradient descent (ascent) optimization. We reveal this difficulty by establishing two negative results: (1) optimizing any expectation with respect to the softmax must exhibit sensitivity to parameter initialization (<code>softmax gravity well</code>), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence (</code>softmax damping</code>). Both findings are based on an analysis of convergence rates using the Non-uniform \L{}ojasiewicz (N\L{}) inequalities. To circumvent these shortcomings we investigate an alternative transformation, the <em>escort</em> mapping, that demonstrates better optimization properties. The disadvantages of the softmax and the effectiveness of the escort transformation are further explained using the concept of N\L{} coefficient. In addition to proving bounds on convergence rates to firmly establish these results, we also provide experimental evidence for the superiority of the escort transformation.},
	author = {Mei, J. and Xiao, C and Dai, B. and Li, L. and Szepesv{\'a}ri, Cs. and Schuurmans, D.},
	crossref = {NeurIPS2020oral},
	month = {12},
	title = {Escaping the Gravitational Pull of Softmax},
	url_url = {https://papers.nips.cc/paper/2020/hash/f1cf2a082126bf02de0b307778ce73a7-Abstract.html},
	url_paper = {NeurIPS2020_pg.pdf},
    booktitle = {NeurIPS},
	year = {2020}
}

Downloads: 48

{"_id":"LhygToA8MGuqprM5Y","bibbaseid":"mei-xiao-dai-li-szepesvri-schuurmans-escapingthegravitationalpullofsoftmax-2020","authorIDs":["279PY77kXFE8vWA2Z"],"author_short":["Mei, J.","Xiao, C","Dai, B.","Li, L.","Szepesvári, C.","Schuurmans, D."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","abstract":"The softmax is the standard transformation used in machine learning to map real-valued vectors to categorical distributions. Unfortunately, this transform poses serious drawbacks for gradient descent (ascent) optimization. We reveal this difficulty by establishing two negative results: (1) optimizing any expectation with respect to the softmax must exhibit sensitivity to parameter initialization (<code>softmax gravity well</code>), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence (</code>softmax damping</code>). Both findings are based on an analysis of convergence rates using the Non-uniform Łojasiewicz (NŁ) inequalities. To circumvent these shortcomings we investigate an alternative transformation, the <em>escort</em> mapping, that demonstrates better optimization properties. The disadvantages of the softmax and the effectiveness of the escort transformation are further explained using the concept of NŁ coefficient. In addition to proving bounds on convergence rates to firmly establish these results, we also provide experimental evidence for the superiority of the escort transformation.","author":[{"propositions":[],"lastnames":["Mei"],"firstnames":["J."],"suffixes":[]},{"propositions":[],"lastnames":["Xiao"],"firstnames":["C"],"suffixes":[]},{"propositions":[],"lastnames":["Dai"],"firstnames":["B."],"suffixes":[]},{"propositions":[],"lastnames":["Li"],"firstnames":["L."],"suffixes":[]},{"propositions":[],"lastnames":["Szepesvári"],"firstnames":["Cs."],"suffixes":[]},{"propositions":[],"lastnames":["Schuurmans"],"firstnames":["D."],"suffixes":[]}],"crossref":"NeurIPS2020oral","month":"12","title":"Escaping the Gravitational Pull of Softmax","url_url":"https://papers.nips.cc/paper/2020/hash/f1cf2a082126bf02de0b307778ce73a7-Abstract.html","url_paper":"NeurIPS2020_pg.pdf","booktitle":"NeurIPS","year":"2020","bibtex":"@inproceedings{MXDLSzS20,\n\tabstract = {The softmax is the standard transformation used in machine learning to map real-valued vectors to categorical distributions. Unfortunately, this transform poses serious drawbacks for gradient descent (ascent) optimization. We reveal this difficulty by establishing two negative results: (1) optimizing any expectation with respect to the softmax must exhibit sensitivity to parameter initialization (<code>softmax gravity well</code>), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence (</code>softmax damping</code>). Both findings are based on an analysis of convergence rates using the Non-uniform \\L{}ojasiewicz (N\\L{}) inequalities. To circumvent these shortcomings we investigate an alternative transformation, the <em>escort</em> mapping, that demonstrates better optimization properties. The disadvantages of the softmax and the effectiveness of the escort transformation are further explained using the concept of N\\L{} coefficient. In addition to proving bounds on convergence rates to firmly establish these results, we also provide experimental evidence for the superiority of the escort transformation.},\n\tauthor = {Mei, J. and Xiao, C and Dai, B. and Li, L. and Szepesv{\\'a}ri, Cs. and Schuurmans, D.},\n\tcrossref = {NeurIPS2020oral},\n\tmonth = {12},\n\ttitle = {Escaping the Gravitational Pull of Softmax},\n\turl_url = {https://papers.nips.cc/paper/2020/hash/f1cf2a082126bf02de0b307778ce73a7-Abstract.html},\n\turl_paper = {NeurIPS2020_pg.pdf},\n booktitle = {NeurIPS},\n\tyear = {2020}\n}\n\n","author_short":["Mei, J.","Xiao, C","Dai, B.","Li, L.","Szepesvári, C.","Schuurmans, D."],"key":"MXDLSzS20","id":"MXDLSzS20","bibbaseid":"mei-xiao-dai-li-szepesvri-schuurmans-escapingthegravitationalpullofsoftmax-2020","role":"author","urls":{" url":"https://papers.nips.cc/paper/2020/hash/f1cf2a082126bf02de0b307778ce73a7-Abstract.html"," paper":"https://www.ualberta.ca/~szepesva/papers/NeurIPS2020_pg.pdf"},"metadata":{"authorlinks":{"szepesvári, c":"https://sites.ualberta.ca/~szepesva/pubs.html"}},"downloads":48},"bibtype":"inproceedings","biburl":"https://www.ualberta.ca/~szepesva/papers/p2.bib","creationDate":"2020-12-27T19:39:53.818Z","downloads":48,"keywords":[],"search_terms":["escaping","gravitational","pull","softmax","mei","xiao","dai","li","szepesvári","schuurmans"],"title":"Escaping the Gravitational Pull of Softmax","year":2020,"dataSources":["dYMomj4Jofy8t4qmm","Ciq2jeFvPFYBCoxwJ","v2PxY4iCzrNyY9fhF","cd5AYQRw3RHjTgoQc"]}