L2 Loss and Gaussian Distribution

2019-04-25 758 words 4 minutes

Contents

Minimizing L2 loss comes from the assumption that the data is drawn from Gaussian distribution

The sentence “Minimizing L2 loss comes from the assumption that the data is drawn from Gaussian distribution” is taken from one paper Deep multi-scale video prediction beyond mean square error . The following is my understanding based on some useful resources.

Suppose we have a dataset $\mathcal{X}={\mathbf{x}^{(1)},\mathbf{x}^{(2)}, \cdots, \mathbf{x}^{(m)}}$. The data comes from an unknown distribution $p_{data}(\mathbf{x})$ and we use $p_{model}(\mathbf{x}; \theta)$, which is parametrized by $\theta$, to model the unknown data distribution $p_{data}(x)$. We use max-likelihood to estimate the parameters. $$ \theta_{ML}=p_{model}(\mathcal{X};\theta)=\prod_{i=1}^mp_{model}(\mathbf{x}^{(i)};\theta) $$ As consecutive multiplications is not so easy to compute and it suffers from computer precision, so we apply $log$ operation on both sides of the above equation. $$ \theta_{ML}=\sum_{i=1}^mlog~p_{model}(\mathbf{x}^{(i)};\theta) $$ If we assume that $p_{model}(\mathbf{x};\theta)$ follows a Gaussian distribution $p_{model}(\mathbf{x};\theta)\sim \mathcal{N}(\mu, 1)$ and learn $\theta := \mu$ using max-likelihood. $$ \begin{align} \theta_{ML} &= \arg\max\limits_{\theta}\sum_{i=1}^mlog~p_{model(\mathbf{x}^{(i)};\theta)}\\\ &=\arg\max\limits_{\theta}\sum_{i=1}^m log~\frac{1}{\sqrt{2\pi}}exp{{-(x-\theta)^2}}\\\ &=\arg\max\limits_{\theta}\sum_{i=1}^m (log~\frac{1}{\sqrt{2\pi}}+(-(x-\theta)^2)) \\\ &=\arg\max\limits_{\theta}-m~log~\sqrt{2\pi}-\sum_{i=1}^m(x-\theta)^2 \\\ &=\arg\max\limits_{\theta}-\sum_{i=1}^m(x-\theta)^2\\\ &=\arg\min\limits_{\theta}\sum_{i=1}^m(x-\theta)^2 \end{align} $$ We can see that $\arg\min\limits_{\theta}\sum_{i=1}^m(x-\theta)^2$ is the same as L2 loss when $\theta$ is set to $\hat{\mathbf{x}}$. So this demonstrate that when using L2 as loss function, it equals to max-likelihood of a Gaussian distribution. So our data is assumed to be drawn from Gaussian distribution.

Other materials

I wrote to Michael and he replied as follows:

Hi,

Thank you for your interest in our paper. This sentence can be confusing. Your intuition is correct, but here I was not talking about prediction but the underlying data distribution.

If you are talking about the prediction, it’s as you say, using l2 means you assume the prediction error is a zero mean Gaussian.

I was going one step further: if you assume that your prediction error is a zero mean Gaussian, your model is correct only if the data was generated by having Gaussian noise added to the mean of the next frames.

Of course, this is not the case in practice, the noise is far from Gaussian, and that is the reason we tried different losses.

I hope that is clear.

Best, Michael

And I also wrote to the author of Ref. 1 and he replied as follows:

Hi Chuanting,

We assume a model $p(x | \theta)$ to estimate the unknown data distribution $p(x)$. We can use maximum likelihood which corresponds to finding $\theta$ that makes our model fit our data the best. Now, if we assume $p(x | \theta) := N(x | \theta, \sigma)$ where $\theta$ is the mean parameter and $\sigma$ fixed, it is clear from the reasoning above that we are assuming $x$ is coming from our model which is a Gaussian. This is what I meant from what I wrote in the post you linked.

To answer your second question: Given the above model, to fit it to our dataset, we take $\log$ to get the log-likelihood function. Indeed, up to some constant, we will get the negative of L2-loss. So from above and this, it is clear that, if we assume our data is coming from a Gaussian, and we maximize its log likelihood, it is the same as minimizing the L2-loss.

Hope this helps.

What I meant exactly is that the data, not the error, that came from a Gaussian.

To give a sketch, assume our dataset is $D = {x_n}$, sampled iid. from an unknown distribution $p(x)$. Now, we would like to model this distribution with a model $q(x | \theta) := N(x | \mu, 1)$ and learn $\theta := \mu$ using max-likelihood. Notice that in this case, we assume that the data (and not the error [in fact we do not have any notion of “error” here]) are coming from a Gaussian $q(x | \theta)$.

Now, we do optimization: $\max\limits_{\theta} \log \prod\limits_n q(x_n | \theta)$. If we work out the algebra, we will have this optimization: $\max\limits_\mu -\sum\limits_n (x_n - \mu)^2$ equals to $ \min\limits_\mu \sum\limits_n (x_n - \mu)^2$. And indeed, this is just the L2 loss.

I hope from a simple example above, it is clear how the Gaussian assumption over our data (and not the error) would yield L2 loss.

Cheers, Agustinus

I hope these materials can help others to understand the reason that “minimizing L2 loss comes from the assumption that the data is drawn from Gaussian distribution”.

Updates 2019-04-23

Finally I found one excellent explanation on Quora, see more details through this link .

References

Ref. 1: Why does L2 reconstruction loss yield blurry images? Link

Ref. 2: Maximum likelihood estimators and least squares. Link

Ref. 3: MSE as Maximum Likelihood. Link