Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity. Yun, C., Sra, S., & Jadbabaie, A. arXiv, 2018.
abstract bibtex

We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require \$N\$ hidden nodes to memorize/interpolate arbitrary \$N\$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with \$\Omega(\sqrt{N})\$ hidden nodes can perfectly memorize most datasets with \$N\$ points. We also prove that width \$\Theta(\sqrt{N})\$ is necessary and sufficient for memorizing \$N\$ data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an \$L\$-layer network with \$W\$ parameters in the hidden layers can memorize \$N\$ data points if \$W = \Omega(N)\$. Combined with a recent upper bound \$O(WL\log W)\$ on VC dimension, our construction is nearly tight for any fixed \$L\$. Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of \$N\$ hidden nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with much smaller empirical risk.

@Article{Yun2018,
author = {Yun, Chulhee and Sra, Suvrit and Jadbabaie, Ali}, 
title = {Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity}, 
journal = {arXiv}, 
volume = {}, 
number = {}, 
pages = {1810.07770v3}, 
year = {2018}, 
abstract = {We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require \$N\$ hidden nodes to memorize/interpolate arbitrary \$N\$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with \$\Omega(\sqrt{N})\$ hidden nodes can perfectly memorize most datasets with \$N\$ points. We also prove that width \$\Theta(\sqrt{N})\$ is necessary and sufficient for memorizing \$N\$ data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an \$L\$-layer network with \$W\$ parameters in the hidden layers can memorize \$N\$ data points if \$W = \Omega(N)\$. Combined with a recent upper bound \$O(WL\log W)\$ on VC dimension, our construction is nearly tight for any fixed \$L\$. Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of \$N\$ hidden nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with much smaller empirical risk.}, 
location = {}, 
keywords = {}}

Downloads: 0

{"_id":"idtGuK8rNkCisENBu","bibbaseid":"yun-sra-jadbabaie-smallrelunetworksarepowerfulmemorizersatightanalysisofmemorizationcapacity-2018","authorIDs":[],"author_short":["Yun, C.","Sra, S.","Jadbabaie, A."],"bibdata":{"bibtype":"article","type":"article","author":[{"propositions":[],"lastnames":["Yun"],"firstnames":["Chulhee"],"suffixes":[]},{"propositions":[],"lastnames":["Sra"],"firstnames":["Suvrit"],"suffixes":[]},{"propositions":[],"lastnames":["Jadbabaie"],"firstnames":["Ali"],"suffixes":[]}],"title":"Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity","journal":"arXiv","volume":"","number":"","pages":"1810.07770v3","year":"2018","abstract":"We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require \\$N\\$ hidden nodes to memorize/interpolate arbitrary \\$N\\$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with \\$\\Omega(\\sqrt{N})\\$ hidden nodes can perfectly memorize most datasets with \\$N\\$ points. We also prove that width \\$\\Theta(\\sqrt{N})\\$ is necessary and sufficient for memorizing \\$N\\$ data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an \\$L\\$-layer network with \\$W\\$ parameters in the hidden layers can memorize \\$N\\$ data points if \\$W = \\Omega(N)\\$. Combined with a recent upper bound \\$O(WL\\log W)\\$ on VC dimension, our construction is nearly tight for any fixed \\$L\\$. Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of \\$N\\$ hidden nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with much smaller empirical risk.","location":"","keywords":"","bibtex":"@Article{Yun2018,\nauthor = {Yun, Chulhee and Sra, Suvrit and Jadbabaie, Ali}, \ntitle = {Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity}, \njournal = {arXiv}, \nvolume = {}, \nnumber = {}, \npages = {1810.07770v3}, \nyear = {2018}, \nabstract = {We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require \\$N\\$ hidden nodes to memorize/interpolate arbitrary \\$N\\$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with \\$\\Omega(\\sqrt{N})\\$ hidden nodes can perfectly memorize most datasets with \\$N\\$ points. We also prove that width \\$\\Theta(\\sqrt{N})\\$ is necessary and sufficient for memorizing \\$N\\$ data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an \\$L\\$-layer network with \\$W\\$ parameters in the hidden layers can memorize \\$N\\$ data points if \\$W = \\Omega(N)\\$. Combined with a recent upper bound \\$O(WL\\log W)\\$ on VC dimension, our construction is nearly tight for any fixed \\$L\\$. Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of \\$N\\$ hidden nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with much smaller empirical risk.}, \nlocation = {}, \nkeywords = {}}\n\n\n","author_short":["Yun, C.","Sra, S.","Jadbabaie, A."],"key":"Yun2018","id":"Yun2018","bibbaseid":"yun-sra-jadbabaie-smallrelunetworksarepowerfulmemorizersatightanalysisofmemorizationcapacity-2018","role":"author","urls":{},"downloads":0},"bibtype":"article","biburl":"https://gist.githubusercontent.com/stuhlmueller/a37ef2ef4f378ebcb73d249fe0f8377a/raw/6f96f6f779501bd9482896af3e4db4de88c35079/references.bib","creationDate":"2020-01-27T02:13:33.769Z","downloads":0,"keywords":[],"search_terms":["small","relu","networks","powerful","memorizers","tight","analysis","memorization","capacity","yun","sra","jadbabaie"],"title":"Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity","year":2018,"dataSources":["hEoKh4ygEAWbAZ5iy"]}