Out-of-distribution Object Detection through Bayesian Uncertainty Estimation

The superior performance of object detectors is often established under the condition that the test samples are in the same distribution as the training data. However, in many practical applications, out-of-distribution (OOD) instances are inevitable and usually lead to uncertainty in the results. In this paper, we propose a novel, intuitive, and scalable probabilistic object detection method for OOD detection. Unlike other uncertainty-modeling methods that either require huge computational costs to infer the weight distributions or rely on model training through synthetic outlier data, our method is able to distinguish between in-distribution (ID) data and OOD data via weight parameter sampling from proposed Gaussian distributions based on pre-trained networks. We demonstrate that our Bayesian object detector can achieve satisfactory OOD identification performance by reducing the FPR95 score by up to 8.19% and increasing the AUROC score by up to 13.94% when trained on BDD100k and VOC datasets as the ID datasets and evaluated on COCO2017 dataset as the OOD dataset.


I. INTRODUCTION
Out-of-distribution (OOD) detection is a machine learning technique that aims to detect test samples drawn from a distribution that differs from the distribution of the training data, with the definition of distribution to be well-defined according to the application in the target.The goal of OOD detection is to differentiate between inputs that are likely to be part of the distribution the model was trained on, and inputs that are not [4].The vast majority of modern deep neural networks [5] are deterministic models that provide highconfidence predictions for inputs that are not seen during training, leading to poor generalization and unreliable results, as shown for example in Figure 1.OOD detection is particularly important for safety-related scenarios.For instance, an ideal uncertainty-aware deep object detector built for selfdriving cars should be able to recognize the in-distribution (ID) target, such as people and cars, and produce a low-confidence prediction for the OOD objects, such as an animal.
Exploring the uncertainties of the deep learning algorithms can help perform OOD detection [4].In general, there are two types of uncertainties: aleatoric uncertainty, due to the inherent and irreducible uncertainty of the input data, and epistemic uncertainty due to various deep learning model provided by a standard Faster-RCNN [2] model trained on the same dataset.(b) The same Faster-RCNN model produces overconfident predictions on MS-COCO dataset [3], which is an OOD dataset in this case.(c) The "ideal" predictions provided by the uncertaintyaware object detection model can recognize OOD data.
frameworks [6].A Bayesian Neural Networks (BNNs) is a type of neural network that incorporates Bayesian methods, such as probability distributions over model parameters, to represent these two types of uncertainties in the model's predictions [7].BNNs can be implemented for OOD detection by comparing the uncertainties between the model's predictions on a given input and the model's predictions on a set of known ID data.If the uncertainty in the input is high, the input is likely OOD.Also, Bayesian methods can be used to estimate the likelihood of the data under the model.If this likelihood is low, the input is more likely to be OOD data.However, due to the high dimensionality and multi-modality of modern deep neural networks with millions of weight parameters, Bayesian approaches have largely been intractable for state-of-the-art deep learning object detectors [8].State-of-the-art deterministic CNNs have achieved huge success on object detection tasks [9]- [11].A new inquiry is whether it is possible to create a Bayesian object detector that can maintain its outstanding performance on the ID dataset and have the ability to model uncertainty when it encounters the OOD dataset concurrently.
In order to address this problem, we propose a novel and scalable uncertainty-aware Bayesian deep learning method for OOD object detection.We exploit the information contained in the deterministic deep neural network layers when the model is trained in the ID dataset to efficiently approximate the posterior distribution over the weights.Our Bayesian approach focuses on performing Bayesian inference by sampling the predictions from the proposed approximation Gaussian distributions of neural weights.Our Bayesian inference method transforms the deterministic deep neural network layers into probabilistic Bayesian layers with no additional training cost.It makes it possible to maximize the uncertainty quantification abilities of BNNs on large computer vision tasks via BNNs.
In particular, our main contributions are: • In this work, we propose a scalable and flexible approximate Bayesian inference technique for OOD detection for deep learning object detectors.Our method offers the ability of uncertainty estimation to distinguish the OOD data from the ID data while preserving their outstanding performance on the ID task for large deep learning models.Our framework provides a flexible uncertainty estimation approach by choosing different layers transformed into Bayesian layers during the model inference stage.Different levels of uncertainties can be chosen to be reserved and estimated.• Our proposed Bayesian inference technique has been demonstrated effective on OOD detection tasks by comprehensively evaluated on typical OOD detection benchmarks.We show in Section IV that compared to other Bayesian methods, our method can reduce the FPR95 score by up to 8.19% and increase the AUROC score by up to 13.94% on BDD and PASCAL VOC as ID datasets and COCO2017 as the OOD dataset.The rest of the paper is organized in the following way.Section II provides the background of the OOD detection and Bayesian deep learning methods.In Section III, the proposed Bayesian inference framework is introduced in detail.IV shows the comprehensive experimental results and analysis on OOD benchmarks on both object detection and image classification tasks.The conclusions are presented in Section V.

A. Out-of-distribution Detection
The goal of out-of-distribution detection, or OOD detection, is to identify input data coming out of a different distribution from the training distribution.OOD detection is an important technique used to help neural networks determine their capability boundary [12].More specifically, given a labeled training dataset of N data pairs as D = {x n , y n } N n=1 , where x n is an input data sample in the domain X , and y n its corresponding target value in the domain Y.The goal of OOD detection is to build a detector f that f (x 1 , ..., x n ) = 1, ∀ i , p(x i ) ≥ δ and f (x 1 , ..., x n ) = 0, ∀ i , p(x i ) ≤ δ where δ is the capability boundary.
OOD detection for classification can be broadly categorized into two main approaches: generative-based methods [13]- [16] and reconstruction-based methods [17]- [20].A conceptually appealing approach to OOD detection involves fitting a generative model to a data distribution p(x; θ) and evaluating the likelihood of unseen samples under this model.
The assumption is that OOD samples will be assigned a lower likelihood than in-distribution samples and can be identified using a simple threshold on this value [21].Some of the stateof-the-art algorithms are: ensembling [22], ODIN [23], energy score [24], Mahalanobis distance [25], Gram matrices based score [26], and GradNorm score [27].
OOD detection for object detection is currently underexplored [28]- [30].Energy-based [24] has been proved with both mathematical insights and empirical evidence that the energy score is superior to both a softmax-based score and generativebased methods for OOD detection.Based on directly predicting the conditional target density of an energy-based model, a novel framework for OOD detection by adaptively synthesizing virtual outliers that can meaningfully regularize the model's decision boundary during training known as VOS [28] has been proved effective on OOD detection by modeling the aleatoric uncertainty.As a comparison, our method combines the uncertainty estimation for the classification and localization regression with a novel OOD scoring rule.

B. Bayesian Methods for Probabilistic Object Detection
Different from deterministic neural networks, BNNs aim to provide a natural interpretation of uncertainty estimation in deep learning, by inferring distributions over a network's weights θ ∈ W based on the training dataset D.
BNNs produce the predictive distribution p(y|x, D) by integrating over all values of network weights: where p(y|x, θ) represents the predicted distribution, and p(θ|D) represents the weight posterior distribution over the dataset [31].
BNNs can be used for OOD detection by comparing the uncertainty of the model's predictions on a given input to the uncertainty of the model's predictions on a set of known in-distribution data.In order to calculate the approximate solutions of the posterior distribution, various techniques have been proposed [32], [32], [33], [33]- [41].However, due to the high dimensionality and multi-modality of modern deep neural networks with millions of weight parameters, the posterior distribution is intractable for most of the state-of-the-art object detectors.Markov Chain Monte Carlo (MCMC) methods are wellrecognised methods for inference, including those combined with neural networks, for instance, the Hamiltonian Monte Carlo (HMC) [32].However, the biggest disadvantage is that requiring full gradients makes HMC computationally intractable for modern neural networks.Later the HMC framework has been extended into stochastic gradient HMC (SGHMC) [33], which allows stochastic gradients to be used in Bayesian inference.As an alternative, stochastic gradient Langevin dynamics (SGLD) [37] employs first-order Langevin dynamics in the stochastic gradient process.
Monte-Carlo Dropout aims to reduce the computational cost for approximating the true posterior distributions [42].It is a Bayesian inference method implemented based on the regularization techniques with Stochastic Gradient Descent (SGD) during the training stage, which links dropout-based neural network training to Variational Inference (VI) in BNNs.This method is later extended [43] by proving that the optimal dropout rate can be found by treating the dropout probability as a hyperparameter.During test time, samples from the approximate posterior distribution are generated by performing inference multiple times with dropout enabled: p(y|x, D) ≈ 1 T T t=1 p(y|x, θ t ).However, recent researchers are questioning whether MC Dropout is a Bayesian method [44].The implementation of MC Dropout fundamentally depends on the dropout layers being used as the regularization method during the training stage.Since 2015, Batch Normalization (BN) [45] has been applied to normalize the activations of a layer in a neural network by adjusting and scaling the activations to reduce internal covariate shifts.Due to the problem of variance shift problem when using BN and Dropout at the same time [46], BN has been proven as a more effective normalization method, which leads to Dropout being barely used in modern deep neural networks.This makes it difficult to use MC Dropout as a Bayesian inference method in advanced deep-learning models.
Direct Modeling Methods refer to those methods which assume certain probability distributions over the weight parameters in the neural networks, and then directly predict parameters for such distributions through output layers [31].Some methods aim at modifying the network's output layers and the loss function to model the uncertainty through a single forward pass.A softmax function with a Gaussian distribution is used to replace the standard softmax function with the cross-entropy loss [6] for classification uncertainty quantification.For object detection tasks, some methods focused on modeling the uncertainty coming out of the bounding box regression [47], [48].Furthermore, several works propose to estimate higher-order conjugate priors in addition to directly predicting the output probability distributions [49].Stochastic Gradient Descent (SGD) Based Approximations [50] use the iterates of averaged SGD as an MCMC sampler.Stochastic Weight Average Gaussian (SWAG) [40] for Bayesian model averaging and uncertainty estimation proves that the posterior distribution over neural network parameters is close to Gaussian in the subspace spanned by the trajectory of SGD.

III. METHOD
Our proposed Bayesian object detection inference framework is illustrated in Figure 2. One significant advantage of our method is that we can select which layers of the model to sample.Different levels of uncertainties can be represented by choosing the different numbers of Bayesian layers.The experimental results of choosing different layers to perform OOD detection are shown in section IV.

A. Bayesian Neural Networks for Object Detection
Followed by the definition of the data in section II, the goal of OOD detection for object detection is to provide predictions based on the training data While the straightforward idea is to infer the posterior distribution p(y|x, θ) based on the training dataset D during the training stage.Variational Inference (VI) implies fitting a Gaussian variational posterior approximation over the weights of neural networks, and then approximating the prior distribution through Kullback-Leibler (KL) divergence [35].VI is later generalized through the reparameterization trick for DNNs [34], [51].Same as other Bayesian approximation methods introduced in Section II, VI methods are empirically hard to be deployed on deep neural networks with numerous parameters.

B. Weights Sampling from the Bayesian Layers
Instead of training to approximate the posterior distribution p(y|x, θ), our key idea is that, given a pre-trained model, we assume the uncertainty representation of the weight parameters of different Bayesian layers from class-conditional multivariate Gaussian distributions: where µ θ is the Gaussian mean, Σ is the related covariance matrix, and θ l is weight parameters corresponding to different layers in the neural networks.The ability to model epistemic uncertainty can be achieved by sampling the weights from the proposed prior distribution.
To formulate the parameters of the proposed classconditional Gaussian distributions, we propose that the sample class means μθ are chosen as the values of the weight parameters θ pre from the pre-trained models.The true posterior weight distribution p(θ|D, ω), following data assimilation D, is proportional to the product of the prior weight distribution p(θ|ω) and the likelihood p(D|θ).Instead of approximation methods such as MCMC [52], a particular weight prior is chosen corresponding to the weight decay ω.Typically, weight decay is used to regularize DNNs, corresponding to explicit L2 regularization during the SGD [53] training process without momentum.When SGD is used with momentum, implicit regularization still exists, producing a vague prior on the weights of the deep neural network.This regularizer can be given an explicit Gaussian-like form [54], corresponding to a prior distribution on the weights.
Furthermore, we are sampling the weights θ l from the ϵlikelihood region of the estimated class-conditional distribution: where θl ∼ N ( μθ , Σ) denotes the proposed posterior distribution for layer l, which are in the sub-level set based on the likelihood.The hyperparameter ϵ is chosen to be sufficiently small as the sampled weight parameters will not be too far away from the original value causing the potential decrease of the performance of the model.

C. OOD Scoring Rule
An appropriate scoring rule S(p θ (x, b)) is required to distinguish the OOD from the ID data.Scoring rules can be further divided into local and non-local rules [55].As a local scoring rule, the negative Log Likelihood (NLL) function [56] evaluates a predictive distribution based on its value only at the true target.NLL measure the quality of predicted probability distributions of a test dataset by: NLL = − 1 Ntest Ntest n=1 log p(y n |x n , D), where x n is a test data point, and y n its corresponding ground truth label.NLL ranges in (−∞, +∞), with a lower NLL score indicating a better fitting predictive distribution for that specific ground truth label.However, NLL has its limitation in learning and evaluating bounding box predictive distributions.Energy Score [57] is a proper and non-local scoring rule as an alternative for learning and evaluating multivariate Gaussian predictive distributions.Energy score uses a non-probabilistic energy function to attribute lower values to in-distribution data and higher values to out-of-distribution data.Energy score has been proven efficient compared to the standard softmax score for OOD detection on both image classification and object detection tasks [24], [28].For object detection tasks, the object-level energy score function [58] can be defined as where the f k ((x, b); θ) is the logit output for class k in the classification head.T represents the temperature parameter of the Gibbs distribution [24].In this case, it is set as a constant hyperparameter.
During inference, given a test input x, the object detector produces a bounding box prediction b through the Bayesian model consists of sampled parameters θ l .The OOD uncertainty score for the predicted object (x * , b * ) is given by: ϕ(•) can be a trainable nonlinear function that allows flexible energy surface learning.In this case, ϕ(•) is set to be a hyperparameter for inference only.
For OOD detection, ID and OOD objects are distinguished through the following equation: where G(x, b) = 1 implies that the current object is ID, and G(x, b) = 0 means the OOD data.The threshold γ is typically chosen so that a high fraction of ID data (e.g., 95%) is correctly classified.

IV. EXPERIMENTAL RESULTS
In this section, we demonstrate the effectiveness of our proposed Bayesian inference method on a wide range of OOD benchmarks with different base models for both object detection and image classification tasks, respectively in Section IV-A and in Section IV-B.The visualization results are shown in Figure 3.
A. Evaluation on Object Detection a) Experimental Setup: To evaluate our method, we use PASCAL VOC1 [1] and Berkeley DeepDrive 100K (BDD-100k2 ) [59] datasets as the ID training data, MS-COCO [3] as the OOD data.PASCAL VOC 2007 and PASCAL VOC 2012 containing 21493 images for train/validation/test data are used.BDD-100k contains 70,000/10,000/20,000 images for train/val/test with 1.8M objects.COCO2017 includes 123,287 images and 886,284 instances.
The models are implemented based on the Detectron2 library [60].Faster-RCNN with ResNet-50 backbone [61] is used as the base framework for object detection.We choose sample numbers of 1000 for each single test run.The Bayesian layers W l are chosen to be the 2D Convolutional layers in the main results listed in Table I.Faster R-CNN model with a ResNet-50 backbone has a total of 52 Conv2D layers including 49 Conv2D layers from the ResNet-50 backbone and 3 Conv2D layers from the RPN and detection head.The impact of selecting different layers to be Bayesian is discussed in Section IV-A0d.
b) Evaluation Metrics: Two evaluation metrics are used to evaluate the performance of proposed models on OOD detection, which are: (1)FPR95, the false positive rate of OOD samples when the true positive rate of ID samples is at 95%; (2)AUROC, computes Area Under the Receiver Operating Characteristic Curve.In addition, mean average precision (mAP) is reported to represent the performance on the ID task.
c) Bayesian Layers are Effective on OOD Detection: Table I shows that in comparison to different pre-trained base models, algorithms with Bayesian layers can provide varying degrees of performance enhancement.Our proposed method is compared with several baselines introduced in Section II, including Faster-RCNN (Baseline) [2], MC Dropout [42], SWAG [40], BayesOD [47], Energy score [24], and VOS [28].Several popular Bayesian deep learning methods often used as benchmarks are included, as well as outperforming OOD detection algorithms as comparisons to prove the efficiency of inference with Bayesian layers on OOD detection.The comparison precisely highlights the benefits of incorporating Bayesian layers for OOD inference.The experimental results show that all baselines inferred by Bayesian layers have increased OOD detection performance.Results shown in Table III are produced based on the convolutional layers from the backbone and the detection head chosen to be transformed into Bayesian layers.
Compared to each base model, inference with Bayesian layers increases the performance for both experiments with two ID datasets.Some baselines rely on a classification model trained primarily for the ID classification task.Due to the existence of a classification head, these methods can be naturally extended to the object detection model.MC Dropout and SWAG are similar Bayesian methods to our method which is also scalable to deep learning models and  Most existing OOD detection methods focus on modeling a single type of uncertainty.The epistemic uncertainty comes from the classification branch modeling the classifier or the regression branch modeling the bounding box regression.Other methods, such as VOS, rely on synthesis data outliers to measure the aleatoric uncertainty.Inference Bayesian layers makes it possible to quantify the uncertainties from the classification head and the regression head at the same time by choosing to transform Bayesian Layers from backbone and detection heads together.This provides a theoretical explanation of the performance improvement of Bayesian layers being applied to all baselines.Inference with Bayesian layers offers a tool to model the epistemic uncertainty coming out of every corner of the network architecture.Combined with the aleatoric uncertainties modeling allows further performance improvement for OOD detection.II shows the mean FPR95 and AUROC results of 3 runs.Specifically, we consider two types of neural network layers: (i) Convolutional Layer: As the primary building block of a CNN.The convolutional layer computes the convolutional operation of the input images using kernel filters to extract fundamental features.The model captures epistemic uncertainties when meeting the OOD data by replacing deterministic weight parameters with the proposed Gaussian distributions.(ii) Linear Layer: A linear layer, also known as a fully connected layer.It is capable of learning an average rate of correlation between the output and the input without a bias.Linear layers are frequently modified by Bayesian methods since they contain limited parameters leading to less training and computational inference cost.

B. Evaluation on Image Classification
We demonstrate that Bayesian Layers are also effective at common image classification benchmarks beyond object detection.In this case, CIFAR-10 [62] with standard train/val splits is used as the ID training data.WideResNet-40 [63] is chosen to be the base neural network architecture.For image classification, the cross-entropy loss is used as the training objective.All models are evaluated on six OOD datasets:Textures [64], SVHN [65], Places365 [66], LSUN-C [67], LSUN-Resize [67], and iSUN [68].The comparisons are shown in Table III

V. CONCLUSION
In this work, we propose a novel and scalable Bayesian deep learning inference framework for OOD detection.Transforming the deterministic deep neural layers into Bayesian layers by replacing the weights parameters trained by point estimates with assumed Gaussian distributions is able to provide the uncertainty estimation during the inference stage.We show that inference with superficial Bayesian layers can improve the performance of various benchmarks for OOD detection.Unlike other Bayesian deep learning methods, our proposed method does not require extra training, making it scalable for modern advanced deep-learning-based object detectors.For future work, we would like to explore the impact on different assumed distributions for Bayesian layers and the uncertainty quantification method for Bayesian layers.In addition, our approach can be valuable to other machine learning tasks than object detection or image classification.We hope future research will increase attention toward a broader view of OOD uncertainty estimation from a Bayesian layers perspective.

Figure 2 :
Figure 2: The framework of our proposed Bayesian inference method.The base model is first trained on the ID dataset with the joint object detection loss (L reg , L cls ).During the inference stage, we assume the distributions of weight parameters from the chosen layers as class-conditional Gaussians, and sample the results from the low-likelihood region given by the OOD data.The uncertainty estimation produced by the Bayesian layers coming out of the backbone, the classification head or the regression head provides a high uncertainty score S for the OOD data.Combined with the OOD scoring rule with the uncertainty modelling by the Bayesian layers, the OOD data can be separated from the ID data.
where x ∈ X denotes the input image, b ∈ R 4 represents the bounding box coordinates associated with object instances in the image, and y ∈ Y is semantic labels to its corresponding target for K classes classification.In stead of forming the parameters through maximum a-posterior (MAP) optimization θMAP = argmax θ p(θ|D), BNNs aim to produce the predictive classification distribution p θ (ŷ|x) and the bounding box regression distribution p θ ( b|ŷ, x) through a Monte Carlo sampling procedure: p(ŷ|D, x) ≈ 1 T T t=1 p(ŷ|θ l , x) , θ l ∼ p(θ|D), where the ŷ denotes the output predictions of the test input data x, and b.

Figure 3 :
Figure 3: The visualization results of OOD detection.Images are taken from COCO2017.The first row of visualization results is provided by a vanilla Faster-RCNN.The bottom row of visualization results is produced when inference with Bayesian Layers.Models are trained on PASCAL-VOC and evaluated on COCO.Detected objects with blue bounding boxes are classified as one of the ID classes, while objects detected in green bounding boxes indicate that they are recognized as OOD objects.
d) Impact of Bayesian Layers Selection:To further investigate the impact of choosing different layers to be transformed into Bayesian layers during the Bayesian inference, we test five different architectures on COCO as the OOD dataset when trained on BDD-100k as ID data.Table

Table I :
) Main results of the comparison between the proposed method and competitive out-of-distribution detection methods.All methods are implemented based on Faster-RCNN with ResNet-50 as the backbone, and then evaluated on COCO2017 as the OOD dataset.↑ indicates larger values are better and ↓ indicates smaller values are better.All values are percentages.Bold numbers are superior results.We report standard deviations estimated across 5 runs.

Table III :
, with results averaged over six test datasets.We demonstrate that inferencing with Bayesian layers for OOD detection has different levels of performance improvement compared to each benchmark, without sacrificing the ID test classification accuracy (94.84% on pretrained WideResNet) and extra training requirement.OOD detection results of the proposed Bayesian methods on image classification.All models listed are built based on the WideResNet-40 architecture.All models are first trained on CIFAR-10 dataset and then evaluated on several OOD datasets.The mean PR95 and AUROC results are given.