In Skiena’s book The Data Science Design Manual Skiena sums up some principles used by Nate Silver for doing data science. One of these principles is “think probabilistically” with the explanation:

“The real world is an uncertain place, and successful models recognize this uncertainty. There are always a range of possible outcomes that can occur with slight perturbations of reality, and this should be captured in your model. Forecasts of numerical quantities should not be single numbers, but instead report probability distributions. Specifying a standard deviation σ along with the mean prediction µ suffices to describe such a distribution, particularly if it is assumed to be normal.” - The Data Science Design Manual p. 203.

As good statisticians behooves we really ought to test such a statement. It happens that recently I’ve also been watching Corridor Crew videos and their challenges to produce design and render a scene in a day or two.

Thus inspired I decided to combine both ideas into: a paper in a day. The conditions of the challenge are simple, answer an AI/Data Science/Machine Learning research question in a day and produce a blog post at the end which will serve as paper. Of course, I cannot achieve a paper of journal grade in such a short time, that is not the point of the challenge. The goal is to ‘fail faster’, learn something and to have fun. And, if you happen to be curious about the results: read on!

Introduction

As stated above Skiena describes this principle that probabilistic models are better or more successful than non-probabilistic models. He probably meant that these models tend to describe the data better because they catch more information about the nature of the data. But, it made me wonder:

Do neural networks benefit from estimating the standard deviation as well as the target function?

Let me go a little deeper into this. Neural networks are used for function estimation and when combined with the MSE loss (mean squared error) they can be seen as approximating a mean function. This is because minimizing the mean squared error is nearly identical to finding the mean. However, because these neural networks output only single numbers (per input ‘x’) and do not produce a probability density this model should not really be thought of as probabilistic except in the case that it has noise with constant standard deviation.

It is trivially possible to extend the MSE loss to a loss function which can approximate a standard deviation function as well. This is done via the MLE (maximum likelihood estimator) for normal distributed noise with unknown mean and standard deviation. The associated loss function is:

\[l(f(x_i), g(x_i), y_i) = \frac{(y_i-f(x_i))^2}{g(x_i)^2} + \log(g(x_i)^2).\]

By letting two networks with these two different loss functions compete on a task we can test if simply training for the standard deviation as well as for the target function improves the performance of the network or if it actually performs worse.

Because we want this to be a straight up test of the principle we constrain both networks to use the same network topology. The only differing detail is that the final linear layer will approximate for two dimensions instead of one in the network that estimates the standard deviation.

This means that this network is competing for resources between the mean estimator and the standard deviation estimator, but this also means that any benefit derived from estimating the standard deviation may be exploited by mean estimator. And there is some argument to be made that there is more information being exploited by estimating the standard deviation. This comes from linear regression. If we approximate a linear function with normal noise using linear regression then the residuals are independent of the predicted values. In this sense there is different information in the residuals then in the predicted function, and thus the standard deviation network may be able to exploit this extra information.

We setup up our neural networks as fully connected ReLU networks with two different network shapes: one setup with width 10 and depth 3 and another with width 100 and depth 10. The smaller 10-by-3 network should have an easier time in converging to the intended target whereas the bigger network will have much more trouble in converging. This simply comes down to the number of parameters that all need to be estimated and need to cooperate to get close to the target.

For the learner we use the Adam gradient descent method with the amsgrad update in its default parameters. These parameters are:

  • A learning rate of .001
  • Betas at 0.9 and 0.999
  • Epsilon at 1e-8.

We test the principle in four test batteries total. In each test the two different networks go head-to-head in estimating the target function. They will both receive the same 100.000 data points in the same order to ensure fair competition.

In the first battery we test how well the networks of ten wide and three deep perform on simple function estimation with noise. The target functions will be:

  • A constant function: \(x \mapsto 1\).
  • The identity function: \(x \mapsto x\).
  • The square function: \(x \mapsto x^2\).
  • The sigmoid function: \(x \mapsto \frac{e^x}{e^{-x}+e^x}\).
  • The cosine function: \(x \mapsto \cos (x)\).
  • The step function: \(x \mapsto 1_{x>0}\).

We will also vary the standard deviation of the noise along the x-axis according to the following functions:

  • A constant function: \(x \mapsto 1\).
  • The square function: \(x \mapsto x^2+\frac{1}{10}\).
  • The inverted square function: \(x \mapsto 1-x^2+\frac{1}{10}\).
  • The step function: \(x \mapsto 1_{x>0}+\frac{1}{10}\).

A small number has been added to the standard deviation functions to prevent them from becoming zero because the loss term \(\log(g(x_i)^2)\) is likely to become unstable and combined with a momentum learner this instability would be propagated for a while in the network. Put simply, it leads to bad and unrealistic results and the setup of having truly zero is not in the spirit of the test setup. Also, if this proves to be a beneficial network then I have no doubt that corrections will be readily found.

The second battery follows the same setup as the first battery but with the larger network size of width one hundred and depth ten.

The third battery is again the smaller networks but these will now be learning instances of the Brownian motion without noise. The Brownian motion is a highly non-differentiable randomized function. The networks will both be trying to estimate it as is. It is impossible for them to learn it perfectly so they both will have to adapt as if there is noise. This will serve as a mediator to understand how the networks respond to learning a task that is complicated.

The fourth battery is the same as the third battery but with the larger networks.

Before we head into the results I want to mention that you can run this code yourself by using the scripts on my github page here.

Results

We can objectively compare the performance of the networks by checking which gets closer to the actual target. I’ve included plotted results in the gallery at the bottom of the page. In the first test battery doing function approximation with noise and small networks we see that the network which also estimated the standard deviation outperforms 14 out of 24 times. A binomial test provides no proof that the network estimating the standard deviation outperformed the standard network, with p-value 0.15. This is the best performance this network had as for the other test batteries it scored worse. For the second test battery it outperformed only on 10 out of 24 tests, and in the third and fourth batteries it each time outperformed only in 2 out of 10 tests each. All of the associated p-values are above 0.5.

Discussion

Of course, closeness is not the only measure of success. Success has many more aspects, therefore lets look at the images below.

In each image you see four graphs. The top left graph is a plot of the natural logarithm of the true distance between the target function and the network. So, this is not based on the loss received by the learners which is muddied by noise, but rather the distance to the target function.

In the top right graph you see a plot of the target function in black with one-sigma bounds from the standard deviation function in grey dashed lines and the last twenty data points passed to the networks.

In the two bottom graphs you see the same plot as the top right plot but with the network guess added in red. Moreover, in the bottom right graph we also see the network guess for the standard deviation added in as one-sigma bounds on the network guess.

I think these images show some really cool things. The first thing that stands out to me is that there is a lot of jankiness in the larger networks but a lot of kinks in the smaller networks. What you are seeing is similar to overfitting and underfitting.

When the network is large it has a huge space of potential functions which it must search for it to be close to the target function, but because there is noise in the data set it will misinterpret many data points which leads to these weird curves missing the target. From the perspective of the network there isn’t much reason to believe that the curve should be simpler, because it doesn’t understand simple curves in the same way we do. There are somethings you could do to help it converge, among them is using lasso or ridge-regression.

The other side of the coin is that smaller networks have only a very limited way to express themselves because of their small number of parameters. Thus even if it is escalating received losses properly and is also properly optimizing it still can only get so close before it becomes a matter of balancing all the errors instead of learning the function shape. This can be easily seen in the tests with the inverted square noise. Here the standard deviation functions are estimated with small numbers of linear pieces. It would require many linear pieces to really capture the curve, but there is not that much space left in the network to store this data.

Speaking of standard deviation functions, we see that in general it learns the correct deviation function, albeit in crude shapes and curves. So it is really cool to see that plugging in the loss derived from the MLE just works. And, in cases where the network doesn’t estimate the mean correctly it seems to still be capturing roughly 68% of data points within the standard deviation bounds. I hadn’t setup the code to test this, but it would form an interesting case study to see if it is actually doing this.

Looking at the specific cases where it was far off the intended target we see that the network must’ve been making the same error in the same direction over and over again. This highlights for me a problem which we get in function estimation but never in parametric estimation. These networks were making continuously the same mistake near the same points. Looking at the loss function we can immediately see that the network doesn’t have a lot of spacial awareness on the x-axis. All the loss function receives are outputs, but it never thinks about how the input is affecting it except in terms of the derivative. Thus it can only be lucky in converging for given inputs instead of being aware that there is actually a more complicated shape to be learned.

It may benefit the network if it locally keeps track of how often it made an error in a particular direction and scale the loss accordingly. For example, we could use a simple step estimator to learn the residual errors the network is making and punish the network if it keeps making the same directional error. This needs some proper scaling and fiddling to get right, but it may be worth investigating.

Finally, considering the Brownian motions we some kinks in the standard deviation estimate that the mean function should be able to utilize but isn’t. This could be a case where the standard deviation is learning in the direction of a positive value but where the mean function is only capable of utilizing the direction of the negative value. If we gave the mean function one more lair to play with it may become easier to cascade these peaks found in the standard deviation into better approximations of peaks in the target function. This might even help with some of the convergence cases of the simple functions.

Conclusion

All in all I would say that I’d trust the standard neural network with MSE more than the adapted neural network, but with a bit more experimentation I think that networks which also can guess the standard deviation can be a great tool in your data science tool belt. There are some new ideas which might benefit training these networks and I’m raring to go explore these.

As for the challenge, I may have missed my deadline by four days. The actual code was done and ran within a day, but writing this took so much longer than I expected. I’m definitely enthusiastic to do these more often though. I hope you found these as interesting as I did and that you’ll be raring to see what is next. If not, have a look around at some of my other content, perhaps you can find something more to your liking. See you next time!

Results for function approximation with noise by fully connected ReLU networks with width 10 and depth 3.
Results for Brownian motion approximation without noise by fully connected ReLU networks with width 10 and depth 3.
Results for function approximation with noise by fully connected ReLU networks with width 100 and depth 10.
Results for Brownian motion approximation without noise by fully connected ReLU networks with width 100 and depth 10.

Updated: