A Convergence Theory for Deep Learning via Over-Parameterization. Allen-Zhu, Z., Li, Y., & Song, Z. June, 2019. arXiv:1811.03962
A Convergence Theory for Deep Learning via Over-Parameterization [link]Paper  doi  abstract   bibtex   
Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find ${\}textit\{global minima\}$ on the training objective of DNNs in ${\}textit\{polynomial time\}$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: ${\}textit\{polynomial\}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).
@misc{allen-zhu_convergence_2019,
	title = {A {Convergence} {Theory} for {Deep} {Learning} via {Over}-{Parameterization}},
	url = {http://arxiv.org/abs/1811.03962},
	doi = {10.48550/arXiv.1811.03962},
	abstract = {Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find \${\textbackslash}textit\{global minima\}\$ on the training objective of DNNs in \${\textbackslash}textit\{polynomial time\}\$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: \${\textbackslash}textit\{polynomial\}\$ in \$L\$, the number of layers and in \$n\$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100\% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in \$n,L\$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).},
	urldate = {2024-11-21},
	publisher = {arXiv},
	author = {Allen-Zhu, Zeyuan and Li, Yuanzhi and Song, Zhao},
	month = jun,
	year = {2019},
	note = {arXiv:1811.03962},
	keywords = {Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning},
}

Downloads: 0