A Convergence Theory for Deep Learning via Over-Parameterization. Allen-Zhu, Z., Li, Y., & Song, Z. June, 2019. arXiv:1811.03962
Paper doi abstract bibtex Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find ${\}textit\{global minima\}$ on the training objective of DNNs in ${\}textit\{polynomial time\}$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: ${\}textit\{polynomial\}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).
@misc{allen-zhu_convergence_2019,
title = {A {Convergence} {Theory} for {Deep} {Learning} via {Over}-{Parameterization}},
url = {http://arxiv.org/abs/1811.03962},
doi = {10.48550/arXiv.1811.03962},
abstract = {Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find \${\textbackslash}textit\{global minima\}\$ on the training objective of DNNs in \${\textbackslash}textit\{polynomial time\}\$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: \${\textbackslash}textit\{polynomial\}\$ in \$L\$, the number of layers and in \$n\$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100\% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in \$n,L\$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).},
urldate = {2024-11-21},
publisher = {arXiv},
author = {Allen-Zhu, Zeyuan and Li, Yuanzhi and Song, Zhao},
month = jun,
year = {2019},
note = {arXiv:1811.03962},
keywords = {Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning},
}
Downloads: 0
{"_id":"696Rp3XyJS5PXf4Jz","bibbaseid":"allenzhu-li-song-aconvergencetheoryfordeeplearningviaoverparameterization-2019","author_short":["Allen-Zhu, Z.","Li, Y.","Song, Z."],"bibdata":{"bibtype":"misc","type":"misc","title":"A Convergence Theory for Deep Learning via Over-Parameterization","url":"http://arxiv.org/abs/1811.03962","doi":"10.48550/arXiv.1811.03962","abstract":"Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find ${\\}textit\\{global minima\\}$ on the training objective of DNNs in ${\\}textit\\{polynomial time\\}$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: ${\\}textit\\{polynomial\\}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).","urldate":"2024-11-21","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Allen-Zhu"],"firstnames":["Zeyuan"],"suffixes":[]},{"propositions":[],"lastnames":["Li"],"firstnames":["Yuanzhi"],"suffixes":[]},{"propositions":[],"lastnames":["Song"],"firstnames":["Zhao"],"suffixes":[]}],"month":"June","year":"2019","note":"arXiv:1811.03962","keywords":"Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning","bibtex":"@misc{allen-zhu_convergence_2019,\n\ttitle = {A {Convergence} {Theory} for {Deep} {Learning} via {Over}-{Parameterization}},\n\turl = {http://arxiv.org/abs/1811.03962},\n\tdoi = {10.48550/arXiv.1811.03962},\n\tabstract = {Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find \\${\\textbackslash}textit\\{global minima\\}\\$ on the training objective of DNNs in \\${\\textbackslash}textit\\{polynomial time\\}\\$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: \\${\\textbackslash}textit\\{polynomial\\}\\$ in \\$L\\$, the number of layers and in \\$n\\$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100\\% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in \\$n,L\\$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).},\n\turldate = {2024-11-21},\n\tpublisher = {arXiv},\n\tauthor = {Allen-Zhu, Zeyuan and Li, Yuanzhi and Song, Zhao},\n\tmonth = jun,\n\tyear = {2019},\n\tnote = {arXiv:1811.03962},\n\tkeywords = {Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning},\n}\n\n\n\n\n\n\n\n\n\n\n\n","author_short":["Allen-Zhu, Z.","Li, Y.","Song, Z."],"key":"allen-zhu_convergence_2019","id":"allen-zhu_convergence_2019","bibbaseid":"allenzhu-li-song-aconvergencetheoryfordeeplearningviaoverparameterization-2019","role":"author","urls":{"Paper":"http://arxiv.org/abs/1811.03962"},"keyword":["Computer Science - Data Structures and Algorithms","Computer Science - Machine Learning","Computer Science - Neural and Evolutionary Computing","Mathematics - Optimization and Control","Statistics - Machine Learning"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"misc","biburl":"https://bibbase.org/zotero/pa511","dataSources":["PMupqmDbRNFndeT5W","MpmemwLeQzDcKDq6x"],"keywords":["computer science - data structures and algorithms","computer science - machine learning","computer science - neural and evolutionary computing","mathematics - optimization and control","statistics - machine learning"],"search_terms":["convergence","theory","deep","learning","via","over","parameterization","allen-zhu","li","song"],"title":"A Convergence Theory for Deep Learning via Over-Parameterization","year":2019}