Deep Evidential Regression a.k.a. can we trust the model?
Imagine you are the chief ML engineer at a company building autonomous vehicles. Right now, you are in charge of building a robust machine learning model, which has to predict the steering wheel angle based on the scenery. Ideally, you want your model to be as correct and as confident as possible. You will not be happy if the model predicts you should turn 10 degrees right, but with a standard deviation of 20 degrees in both directions. This is as good as playing darts blindfolded. On the other hand, if you input several scenes with only slightly different backgrounds and the outputs vary wildly, you will also likely not be happy, because you feel like you can't trust the model outputs, even if the standard deviation is low. All in all, we want high precision and low variance. But because the real-world applications are not fairytales, the noise inherent to the data and the model persists. To tackle and visualize this issue Alexander Amini et. al. published a paper titled "Deep Evidential Regression" in 2019, introducing a novel way to approach regression modeling by cleverly learning the data- and model-intrinsic uncertainties as two separate entities. By this, they claim, the user of the model would have more oversight of the training process, where control is deemed necessary, such as controlling the steering wheel angle in a self-driving car or predicting risk scores in medical applications.
What is the main idea?
Imagine you have a deep neural network, which predicts the steering wheel angle. Based on the data, you train the neural network to predict the turn angle. After the training is complete, you wish to evaluate the model performance on 100 test samples to see how the model performs. If you plot the difference between the predictions and the ground truths, you probably get a Gaussian curve, centered at μ with a variance of σ², which is commonly denoted as:
So far so good, but what would have happened if we trained say 10 different neural networks with different starting parameters. By doing that we would get 10 different means and 10 different variances. If we treat means and variances as data points and plot them in two dimensions, we would get something like this:
How do we interpret this? Well, because the mean is centered at 0, the epistemic (model) uncertainty is low, which means that the training process was successful and is not biased, except for one or maybe two outliers. The Y-axis however represents the variance in the predictions, which is definitely not negligible - the aleatoric (data-intrinsic) uncertainty is high, which means that there is some inherent noise in the training data.
This is where the magic happens.
By doing that, we have obtained a new, evidential distribution of data points, each of which is also an output distribution.
So how does the math work?
The first thing to note when approaching this problem is that the issue is more or less Bayesian. We will start by creating two prior distributions with random parameters and optimize them to fit the training data and reiterate again. We will use a Gaussian distribution to estimate the mean and the inverse gamma distribution to estimate the variance.
To calculate the true posterior distribution, we assume that the estimated distribution can be factorized and the resulting uniform Normal-inverse-gamma (NIG) distribution can be expressed using four hyperparameters, as explained in the figure above.
Authors' note
A popular interpretation of the parameters of this conjugate prior distribution is in terms of “virtual observations” in support of a given property [23]. For example, the mean of a NIG distribution can be intuitively interpreted as being estimated from υ virtual observations with sample mean γ, while its variance is estimated from α virtual observations with sample mean γ and sum of squared deviations 2υ. Following this interpretation, we define the total evidence, Φ, of our evidential distributions as the sum of all inferred virtual-observations counts: Φ = 2υ + α. Drawing a sample θǰ from the NIG distribution yields a single instance of our likelihood function, namely N (µǰ, σ²ǰ ). Thus, the NIG hyperparameters, (γ, υ, α, β), determine not only the location but also the dispersion concentrations, or uncertainty, associated with our inferred likelihood function. Therefore, we can interpret the NIG distribution as the higher-order, evidential distribution on top of the unknown lower-order likelihood distribution from which observations are drawn.
Given a NIG distribution, we can compute the prediction, aleatoric, and epistemic uncertainty as:
In short, all of this fancy math above is necessary, so that the transition from individual probability distributions to one master evidential distribution is tractable in both forward and backward passes! This is very important because it is the prerequisite to calculating gradients during backpropagation.
What about the learning process?
There are two objectives by which the weights of the evidential prior are optimized.
(1) Maximizing the model fit
The evidential prior is a three-dimensional landscape (or a two-parameter probability density function) with parameters μ, σ², and the likelihood of the distribution being sampled from the evidential prior distribution as represented by the opacity of blue in the figure above. When we encounter a new observation y, we have to calculate the probability of this new observation belonging to each of the lower-level likelihood distributions inside the evidential prior. This requires double integration across the evidential prior parameter space, which is intractable. But lucky for us, the NIG distribution has an analytical solution, the derivation of which is very well explained in the original paper. But in general, the loss function denoted as negative logarithm likelihood (NLL) loss reduces to the following equation:
(2) Minimizing evidence on errors
The second objective aims to minimize the evidence (marginal likelihood of sampling a distribution from evidential prior) where the difference between the error and the observation is high:
To finish, the total loss is just a composite of NLL and regularized loss, scaled by a regularization parameter, lambda:
So, what is new about all of this?
In the original paper, the authors introduce 4 novelties:
- A novel and scalable method for learning epistemic and aleatoric uncertainty on regression problems, without sampling during inference or training with out-of-distribution data;
- Formulation of an evidential regularizer for continuous regression problems, necessary for penalizing incorrect evidence on errors and OOD examples;
- Evaluation of epistemic uncertainty on benchmark and complex vision regression tasks along with comparisons to state-of-the-art NN uncertainty estimation techniques; and
- Robustness and calibration evaluation on OOD and adversarially perturbed test input data.
Is there a way to quickly use this in code?
YES!
Luckily for us, the authors have created a nicely formatted GitHub repository, which contains the NIG layer and the loss associated with it to enforce the model to learn an evidential prior. Below is a quick code snippet that shows the whole process in action using Python.
References
I hope you found this article insightful and be sure to check the original paper and the presentation by the first author Alexander Amini that he gave at MIT.