Tutorial on Variational Autoencoders

Tutorial on Variational Autoencoders. Doersch, C. ArXiv e-prints, June, 2016.

@article{Doersch:2016vb,
author = {Doersch, Carl},
title = {{Tutorial on Variational Autoencoders}},
journal = {ArXiv e-prints},
year = {2016},
volume = {stat.ML},
month = jun,
annote = {great tutorial on VAE, explaining it, as well as generative model in general, in plain words.

pp. 6-7, why we can't simply estimate P(x), and then compute gradient of it.

Essentially, it's very difficult to get a accurate estimate of P(x). Without much sample, our estimate of P(x) would be very bad, due to the use of l2 norm to measure reconstruction error.

Section 2.1

pp. 7

Instead of sampling hidden variable randomly without any connection to the training data, we should try to sample from its posterior.

> The key idea behind the variational autoencoder is to attempt to sample values of z that are likely to have produced X, and compute P(X) just from those. This means that we need a new function Q(z|X) which can take a value of X and give us a distribution over z values that are likely to produce X.

This is exactly correct, In classical EM algorithm (such as GMM), if we know the posterior, then actually we get a tight (exact) bound on likelihood.

pp. 9

> Assuming we use an arbitrarily high-capacity model for Q(z|x), then Q(z|x) will hopefully actually match P(z|X), in which case this KL- divergence term will be zero, and we will be directly optimizing log P(X). As an added bonus, we have made the intractable P(z|X) tractable: we can just use Q(z|x) to compute it.
 
not necessarily. we may also simply overfit to log P(X).


Section 2.2

pp. 10 brute force sampling without reparamterization won't give you gradient.

essentially, brute force sampling can give you unbiased estimation of E[log P(X|z)], but that estimation will not give you information about its gradient w.r.t. Q.

> However, in Equation 9, this dependency has disappeared!


> Stochastic gradient descent via backpropagation can handle stochastic inputs, but not stochastic units within the network! The solution, called the {\textquotedblleft}reparameterization trick{\textquotedblright} in [1], is to move the sampling to an input layer.

pp. 11 what if hidden variable is discrete. 

I think DL book explains this better. Main problem is, by changing parameters infinitesimally in Q, we won{\textquoteright}t change cost, and we get zero gradient almost everywhere. In those places not {\textquotedblleft}almost everywhere{\textquotedblright}, we get undefined gradient (jump). therefore, we can't optimize it.


Section 2.4.1

Check appendix for a proof that with enough capacity, VAE will find true P(x). This proof assumes that the best solution can be found, and they simply prove that such solution exists.

Section 2.4.2

this is a very good interpretation of loss function. 

I don't buy that because P(X|z) doesn't have covariance, then it's inefficient. Because if our Q is perfect, then the sum of two parts is optimal (log of P(x)), and we can't reduce it anymore.


Section 2.4.3

It's talking about regularization of VAE. unlike regular AE, you can't do regualrization easily. While you have something similar to that when output is Gaussian, it will disappear for binary unit. This is because that's not regularization, but just precision in reconstruction.


Appendix A

> Note that D[Qs(z|X)||Ps(z|X)] is invariant to
affine transformations of the sample space. 

I think this is from wikipedia.

and here, for Q and P, they transformed sample by 

z{\textquoteright} = g(X) + (z {\textminus} g(X)) {\textasteriskcentered} $\sigma$. 


for $P_{\sigma}$, we need to compensate the scaling with multiplying with sigma. It's the basic variable substitution in probability.},
keywords = {deep learning},
read = {Yes},
rating = {5},
date-added = {2017-04-23T23:38:17GMT},
date-modified = {2017-04-24T02:52:20GMT},
url = {http://arxiv.org/abs/1606.05908},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2016/Doersch/arXiv%202016%20Doersch.pdf},
file = {{arXiv 2016 Doersch.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2016/Doersch/arXiv 2016 Doersch.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/0E93CF01-7F2B-4EA6-8413-E0A82F55B1A0}}
}

Downloads: 0

{"_id":"a7ybnpvFmoyiT6EsS","bibbaseid":"doersch-tutorialonvariationalautoencoders-2016","downloads":0,"creationDate":"2017-03-16T18:57:17.640Z","title":"Tutorial on Variational Autoencoders","author_short":["Doersch, C."],"year":2016,"bibtype":"article","biburl":"https://leelabcnbc.github.io/lab-wiki/reference_library/machine_learning/representation_learning/bib.bib","bibdata":{"bibtype":"article","type":"article","author":[{"propositions":[],"lastnames":["Doersch"],"firstnames":["Carl"],"suffixes":[]}],"title":"Tutorial on Variational Autoencoders","journal":"ArXiv e-prints","year":"2016","volume":"stat.ML","month":"June","annote":"great tutorial on VAE, explaining it, as well as generative model in general, in plain words. pp. 6-7, why we can't simply estimate P(x), and then compute gradient of it. Essentially, it's very difficult to get a accurate estimate of P(x). Without much sample, our estimate of P(x) would be very bad, due to the use of l2 norm to measure reconstruction error. Section 2.1 pp. 7 Instead of sampling hidden variable randomly without any connection to the training data, we should try to sample from its posterior. > The key idea behind the variational autoencoder is to attempt to sample values of z that are likely to have produced X, and compute P(X) just from those. This means that we need a new function Q(z|X) which can take a value of X and give us a distribution over z values that are likely to produce X. This is exactly correct, In classical EM algorithm (such as GMM), if we know the posterior, then actually we get a tight (exact) bound on likelihood. pp. 9 > Assuming we use an arbitrarily high-capacity model for Q(z|x), then Q(z|x) will hopefully actually match P(z|X), in which case this KL- divergence term will be zero, and we will be directly optimizing log P(X). As an added bonus, we have made the intractable P(z|X) tractable: we can just use Q(z|x) to compute it. not necessarily. we may also simply overfit to log P(X). Section 2.2 pp. 10 brute force sampling without reparamterization won't give you gradient. essentially, brute force sampling can give you unbiased estimation of E[log P(X|z)], but that estimation will not give you information about its gradient w.r.t. Q. > However, in Equation 9, this dependency has disappeared! > Stochastic gradient descent via backpropagation can handle stochastic inputs, but not stochastic units within the network! The solution, called the “reparameterization trick” in [1], is to move the sampling to an input layer. pp. 11 what if hidden variable is discrete. I think DL book explains this better. Main problem is, by changing parameters infinitesimally in Q, we won\\textquoterightt change cost, and we get zero gradient almost everywhere. In those places not “almost everywhere”, we get undefined gradient (jump). therefore, we can't optimize it. Section 2.4.1 Check appendix for a proof that with enough capacity, VAE will find true P(x). This proof assumes that the best solution can be found, and they simply prove that such solution exists. Section 2.4.2 this is a very good interpretation of loss function. I don't buy that because P(X|z) doesn't have covariance, then it's inefficient. Because if our Q is perfect, then the sum of two parts is optimal (log of P(x)), and we can't reduce it anymore. Section 2.4.3 It's talking about regularization of VAE. unlike regular AE, you can't do regualrization easily. While you have something similar to that when output is Gaussian, it will disappear for binary unit. This is because that's not regularization, but just precision in reconstruction. Appendix A > Note that D[Qs(z|X)||Ps(z|X)] is invariant to affine transformations of the sample space. I think this is from wikipedia. and here, for Q and P, they transformed sample by z\\textquoteright = g(X) + (z \\textminus g(X)) \\textasteriskcentered $σ$. for $P_{σ}$, we need to compensate the scaling with multiplying with sigma. It's the basic variable substitution in probability.","keywords":"deep learning","read":"Yes","rating":"5","date-added":"2017-04-23T23:38:17GMT","date-modified":"2017-04-24T02:52:20GMT","url":"http://arxiv.org/abs/1606.05908","local-url":"file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2016/Doersch/arXiv%202016%20Doersch.pdf","file":"arXiv 2016 Doersch.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2016/Doersch/arXiv 2016 Doersch.pdf:application/pdf","uri":"˘rlpapers3://publication/uuid/0E93CF01-7F2B-4EA6-8413-E0A82F55B1A0","bibtex":"@article{Doersch:2016vb,\nauthor = {Doersch, Carl},\ntitle = {{Tutorial on Variational Autoencoders}},\njournal = {ArXiv e-prints},\nyear = {2016},\nvolume = {stat.ML},\nmonth = jun,\nannote = {great tutorial on VAE, explaining it, as well as generative model in general, in plain words.\n\npp. 6-7, why we can't simply estimate P(x), and then compute gradient of it.\n\nEssentially, it's very difficult to get a accurate estimate of P(x). Without much sample, our estimate of P(x) would be very bad, due to the use of l2 norm to measure reconstruction error.\n\nSection 2.1\n\npp. 7\n\nInstead of sampling hidden variable randomly without any connection to the training data, we should try to sample from its posterior.\n\n> The key idea behind the variational autoencoder is to attempt to sample values of z that are likely to have produced X, and compute P(X) just from those. This means that we need a new function Q(z|X) which can take a value of X and give us a distribution over z values that are likely to produce X.\n\nThis is exactly correct, In classical EM algorithm (such as GMM), if we know the posterior, then actually we get a tight (exact) bound on likelihood.\n\npp. 9\n\n> Assuming we use an arbitrarily high-capacity model for Q(z|x), then Q(z|x) will hopefully actually match P(z|X), in which case this KL- divergence term will be zero, and we will be directly optimizing log P(X). As an added bonus, we have made the intractable P(z|X) tractable: we can just use Q(z|x) to compute it.\n \nnot necessarily. we may also simply overfit to log P(X).\n\n\nSection 2.2\n\npp. 10 brute force sampling without reparamterization won't give you gradient.\n\nessentially, brute force sampling can give you unbiased estimation of E[log P(X|z)], but that estimation will not give you information about its gradient w.r.t. Q.\n\n> However, in Equation 9, this dependency has disappeared!\n\n\n> Stochastic gradient descent via backpropagation can handle stochastic inputs, but not stochastic units within the network! The solution, called the {\\textquotedblleft}reparameterization trick{\\textquotedblright} in [1], is to move the sampling to an input layer.\n\npp. 11 what if hidden variable is discrete. \n\nI think DL book explains this better. Main problem is, by changing parameters infinitesimally in Q, we won{\\textquoteright}t change cost, and we get zero gradient almost everywhere. In those places not {\\textquotedblleft}almost everywhere{\\textquotedblright}, we get undefined gradient (jump). therefore, we can't optimize it.\n\n\nSection 2.4.1\n\nCheck appendix for a proof that with enough capacity, VAE will find true P(x). This proof assumes that the best solution can be found, and they simply prove that such solution exists.\n\nSection 2.4.2\n\nthis is a very good interpretation of loss function. \n\nI don't buy that because P(X|z) doesn't have covariance, then it's inefficient. Because if our Q is perfect, then the sum of two parts is optimal (log of P(x)), and we can't reduce it anymore.\n\n\nSection 2.4.3\n\nIt's talking about regularization of VAE. unlike regular AE, you can't do regualrization easily. While you have something similar to that when output is Gaussian, it will disappear for binary unit. This is because that's not regularization, but just precision in reconstruction.\n\n\nAppendix A\n\n> Note that D[Qs(z|X)||Ps(z|X)] is invariant to\naffine transformations of the sample space. \n\nI think this is from wikipedia.\n\nand here, for Q and P, they transformed sample by \n\nz{\\textquoteright} = g(X) + (z {\\textminus} g(X)) {\\textasteriskcentered} $\\sigma$. \n\n\nfor $P_{\\sigma}$, we need to compensate the scaling with multiplying with sigma. It's the basic variable substitution in probability.},\nkeywords = {deep learning},\nread = {Yes},\nrating = {5},\ndate-added = {2017-04-23T23:38:17GMT},\ndate-modified = {2017-04-24T02:52:20GMT},\nurl = {http://arxiv.org/abs/1606.05908},\nlocal-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2016/Doersch/arXiv%202016%20Doersch.pdf},\nfile = {{arXiv 2016 Doersch.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2016/Doersch/arXiv 2016 Doersch.pdf:application/pdf}},\nuri = {\\url{papers3://publication/uuid/0E93CF01-7F2B-4EA6-8413-E0A82F55B1A0}}\n}\n\n\n\n","author_short":["Doersch, C."],"key":"Doersch:2016vb","id":"Doersch:2016vb","bibbaseid":"doersch-tutorialonvariationalautoencoders-2016","role":"author","urls":{"Paper":"http://arxiv.org/abs/1606.05908"},"keyword":["deep learning"],"metadata":{"authorlinks":{}},"downloads":0,"html":""},"search_terms":["tutorial","variational","autoencoders","doersch"],"keywords":["deep learning"],"authorIDs":[],"dataSources":["bzxc3uBcwMv3h47xE","64MNTzwzJ899PcjSK","6AGHfbf4TXye2tqMA"]}