Robust Uncertainty Quantification Using Conformalised Monte Carlo Prediction

Deploying deep learning models in safety-critical applications remains a very challenging task, mandating the provision of assurances for the dependable operation of these models. Uncertainty quantification (UQ) methods estimate the model's confidence per prediction, informing decision-making by considering the effect of randomness and model misspecification. Despite the advances of state-of-the-art UQ methods, they are computationally expensive or produce conservative prediction sets/intervals. We introduce MC-CP, a novel hybrid UQ method that combines a new adaptive Monte Carlo (MC) dropout method with conformal prediction (CP). MC-CP adaptively modulates the traditional MC dropout at runtime to save memory and computation resources, enabling predictions to be consumed by CP, yielding robust prediction sets/intervals. Throughout comprehensive experiments, we show that MC-CP delivers significant improvements over advanced UQ methods, like MC dropout, RAPS and CQR, both in classification and regression benchmarks. MC-CP can be easily added to existing models, making its deployment simple.


Introduction
Advances in Deep Learning (DL) enable its employment in diverse and challenging tasks, including speech recognition (Kumar et al. 2020) and image annotation (Barnard et al. 2003).Despite its numerous potential applications, using DL in safety-critical applications (e.g., medical imaging/diagnosis) mandates ensuring its dependable and robust operation (Pereira and Thomas 2020;Gerasimou et al. 2020).Uncertainty quantification (UQ) is crucial in assessing the DL model's confidence for input-prediction pairs and establishing the potential impact of noisy, sparse, or low-quality input and misspecification in DL models (Kendall, Badrinarayanan, and Cipolla 2016).Ultimately, UQ enables understanding situations where the model is particularly uncertain, instrumenting uncertainty-aware decision-making (Calinescu et al. 2018).
DL-focused methods for UQ aim at assessing model and data uncertainty of DL models (Abdar et al. 2021).In particular, Monte Carlo (MC) dropout (Gal and Ghahramani 2016) elegantly quantifies uncertainty within DL models by outputting the standard deviation of predictions from an ensemble of networks using dropout layers.Running, however, numerous forward passes is computationally expensive.Similarly, Bayesian Neural Networks (BNNs) (MacKay 1992) constitute a more natural UQ method that can estimate both epistemic and aleatoric uncertainty.However, BNNs are computationally-intensive both during training and inference and require substantial fine-tuning.Finally, conformal prediction (CP) (Vovk, Gammerman, and Shafer 2005) produces prediction sets/intervals instead of singletons.The larger the set/interval, the more unsure the model is about its prediction, with a singleton prediction/narrow interval typically signifying large confidence.Despite their merits, CP methods are over-conservative, producing larger sets/intervals than necessary (Fan, Ge, and Mukherjee 2023).
Driven by these advances, we introduce Monte Carlo-Conformal Prediction (MC-CP), a novel hybrid method that comprises adaptive MC dropout and conformal predictive techniques, inheriting both the statistical efficiency of the former and the distribution-free coverage guarantee of conformal prediction.MC-CP dynamically adapts the conventional MC dropout with a convergence assessment, saving memory and computational resources during inference where possible.The predictions are then consumed by advanced CP techniques to synthesize robust prediction sets/intervals.Our experimental evaluation shows that the hybrid MC-CP approach overestimates less than regular CP methods.Despite its simplicity, it outperforms stateof-the-art CP-and MC-based methods, e.g., traditional MC dropout, RAPS (Angelopoulos et al. 2022) and CQR (Romano, Patterson, and Candes 2019), both in classification and regression benchmarks.While RAPS and CQR quantify uncertainty by increasing the prediction set/interval size, MC-CP does this and also outputs an exact quantification in the form of variance in the prediction distribution.Our MC-CP method is designed to be implemented at inference time, in contrast to evidential deep learning and Bayesian neural networks.Whilst these methods provide salient and informative UQ estimations, MC-CP is realised post-training.
Our contributions are: • An adaptive MC dropout method that can save computational resources compared to the original method;

Related Work
Uncertainty Quantification (UQ) in DL indicates how uncertain a model is about its predictions.The most common uncertainty types are aleatoric and epistemic.The former surrounds the irreducible uncertainty within data (e.g., random noise).The latter is the model's lack of knowledge or poor training which can be reduced with more data or better training.MC-CP focuses on quantifying epistemic uncertainty.
Deep ensembles is a straightforward method to quantify uncertainty in DL (Lakshminarayanan, Pritzel, and Blundell 2017).The method involves training an ensemble of networks with the same or similar architecture, initialised with different weights.After training, the ensemble predicts on the same input data using the mean of their predictions as the final prediction and the variance as the uncertainty.
Monte Carlo (MC) dropout (Gal and Ghahramani 2016) is a simple and effective method to compute epistemic uncertainty in DL models by exploiting dropout (Srivastava et al. 2014), a regularization technique that randomly drops units of the neural network to prevent reliance on certain weights.Although dropout is typically used during training, MC dropout keeps this feature active during inference and performs several forward passes to devise a prediction distribution.The final prediction is the mean of the distribution, and the variance signifies the uncertainty.Gaussian dropout (Kingma, Salimans, and Welling 2015) complements regular dropout by adding noise using a Gaussian distribution instead of setting the unit's value to zero.
Bayesian Neural Networks (BNNs) (Kendall, Badrinarayanan, and Cipolla 2016) realise UQ directly in the model's architecture.While in traditional DL networks, weights are a singleton variable, in BNNs, weights are represented as a distribution.Although BNNs produce probabilistic predictions that naturally capture uncertainty, they are computationally intensive and require substantially more training than standard networks, resorting to approximate Bayesian computation techniques like variational inference.
Conformal prediction (CP) (Vovk, Gammerman, and Shafer 2005) is a framework that uses validity to quantify a model's prediction confidence.Validity encodes that, on average, a model's predictions will be correct within a guaranteed confidence level (e.g., 90% of the time).The method then alters the prediction from a singleton/point to a set/interval that indicates the confidence level of the model.The larger the set/interval, the more uncertain the model is, and vice versa.CP involves splitting the test data into two sets: a calibration and a test set.The calibration set is used to estimate the thresholds needed to achieve the desired confidence levels.CP has been applied to a diverse set of applica-tions (e.g., image classification (Angelopoulos et al. 2022), regression (Romano, Patterson, and Candes 2019), object detection (de Grancey et al. 2022)).
An orthogonal method is test time augmentation (Wang et al. 2019;Moshkov et al. 2020) which alters the data at inference time instead of the model or predictions.Given an input, the method creates multiple augmented inputs using various augmentation techniques.The DL model then makes predictions for the augmented inputs; their distribution and variance represent the model's uncertainty.Data augmentation using generative AI has also been proposed to enhance the inference capabilities of DL models (Missaoui, Gerasimou, and Matragkas 2023).

Preliminaries
Given a level of coverage α ∈ (0, 1) signifying a probability guarantee that the true label/point is in the prediction set/interval (1−α)%, Conformal prediction (CP) constructs a prediction set/interval instead of a singleton/point.To achieve this, CP splits the test dataset into a calibration set c and a validation set v. Next, conformal scores s(f (x i ), y i ) ∈ R are calculated for each (x i , y i ) ∈ c.This score is high when the model f (.) produces a low softmax output for the true class, i.e., when the model is very wrong.A quantile threshold ) is calculated, using the desired coverage α, the calibration set c, and the size of the calibration set n, which is used to form prediction sets C(x j ) = {y : f (x j ) ≤ 1 − q} for each new input x j (e.g., from the validation set v).For quantile regression, prediction intervals are formed by C(x j ) = [t α/2 (x j ) − q, t 1−α/2 (x j ) + q] where t are the α-informed quantiles produced by the trained model.
Coverage is a key metric for assessing CP, measuring how often the predicted set/interval contains the ground truth.Coverage is expected to reflect the desired coverage property 1 − α.Given model f , coverage is calculated by: where α is the user-defined coverage, n is the size of the validation set, y i is the true label/value, and f α (x) is the prediction interval/set made by the model for input x i .This equation reflects the percentage of true labels/values captured by the respective prediction sets/intervals.
Efficiency is another important CP metric.While including all possible classes in a prediction set would, by default, yield a perfect accuracy score, it is impractical.Thus, a DL model that achieves the desired coverage efficiently is preferred.Efficiency is calculated as the average expected size of the set/interval, given by: MC-CP  prediction, leveraging their low computational cost and finite sample distribution-free coverage guarantees, respectively.Fig. 1 shows a high-level overview of our MC-CP method for image classification.We discuss next adaptive MC dropout, followed by an exposition of MC-CP for classification and regression.This novel combination of adaptive MC dropout and CP, albeit straightforward, results in a hybrid MC-CP method that yields significant improvements compared to state-of-the-art UQ techniques (Section ).

Adaptive Monte Carlo Dropout
The competitive predictive performance of MC dropout largely depends on the execution of multiple stochastic forward passes of each input through the DL model at inference time.The number of forward passes K the model should perform is defined a priori and is fixed.Since, for any new input, the dropout layers of the DL model are kept on during inference, the ensemble of these K forward passes produces a distribution of predictions.This distribution enables quantifying uncertainty by computing metrics such as the expected (average) value, standard deviation and entropy.
The motivation underpinning adaptive MC dropout originates from the observation that each forward pass corresponds to a particular DL model instantiation that adds unique variance to the prediction distribution.Some of these DL model instantiations, informed by MC dropout forward passes, can produce similar or even the exact same prediction.Hence, although the prediction variance might be large initially, as the number of forward passes increases, the variance value becomes smaller, indicating that the inference process has converged.If the current number of forward passes is substantially less than the maximum number of forward passes K when this event occurs, the remaining forward passes incur only additional overheads but add little to no value.Adaptive MC Dropout leverages this observation to reduce the number of wasted forward passes once convergence is diagnosed, thus yielding significant computational savings without impacting the prediction effectiveness.
Algorithm 1 shows our adaptive MC dropout method.Given a new input x, the method performs up to K forward passes over model f to produce the predictive posterior mean as the final prediction and the variance of the predictive posterior as the prediction uncertainty.Unlike conventional MC dropout, our algorithm uses the hyperparameters threshold δ and patience P to detect the convergence and terminate early.The threshold parameter δ denotes the σ ← V ar(Predictions) Count ← 0 13: maximum difference in variance required to trigger that the class/quantile prediction has likely converged.Patience P signifies the number of consecutive forward passes where all classes/quantiles are below δ to stop the execution early.The criterion of performing P successive forward passes that meet the threshold δ is important in determining convergence and mitigating the potential effect of randomness.
Adaptive MC dropout works as follows.While the current forward pass counter is less than K and the current patience counter is less than P (line 3), the model predicts the input data with dropout layers switched on (line 4).The prediction is added to a list, and the variance of that list is estimated (lines 5-6).From the second forward pass onward, the difference between the current variance σ and the last estimated variance σ i−1 is calculated (line 8).If the difference for all classes/quantiles is below the threshold δ, then the current patience counter is increased (lines 9-10); otherwise, it is reset (line 12).Once all classes/quantiles converge below δ after P consecutive forward passes, the predictive posterior mean and variance are outputted as the predictions and their measured uncertainty, respectively (line 14).
The user-defined parameters threshold δ ∈ (0, 1) and patience P ∈ Z + enable controlling the sensitivity of the adaptive MC dropout to changes in prediction variance.When δ approaches 1, our method becomes less sensitive, allowing to stop earlier.In contrast, the closer δ is to 0, the more sensitive it becomes, requiring the execution of more forward passes until convergence is diagnosed.It can be easily seen that selecting a small δ and large patience P values enables instrumenting the conventional MC dropout method.We demonstrate this remark later in Tables 4 and 6.
We also provide a sketch of the proof for the adaptive MC dropout method.The MC Dropout process is a Bernoulli process; each MC Dropout forward pass is independent of the others, and the model parameters are fixed during our adaptive MC Dropout approach.According to the Law of Large Numbers, as the number of Predictions from line 5 of Algorithm 1 increases, the sample variance where k ′ is the model's ranking of the true class y j and π(i) (x j ) is the i th largest score for the j th image.4: Find the threshold: assign Tccal to the 1 − α quantile of the E j .

Conformal Prediction
1: Mean softmax: retrieve softmax and variance from Adaptive Monte Carlo Dropout(f, v, K, δ, P ).2: Prediction set: output the k * highest-score classes, where σ from line 6 will converge to the true variance σ true of the MC Dropout output population, and there exists a number of forward passes N = #Predictions such that for all i ≥ N , |σ − σ true | < δ/2.We show that the while loop from lines 3-13 terminates after fewer than K iterations if N < K − P .To that end, we note that, since the σ value computed in iterations N , N + 1, . . ., N + P of the while loop is within δ/2 of σ true , in each of these successive iterations diff = |σ i−1 − σ| < δ in line 8, and therefore Count is incremented in line 10, reaching the value P and ending the while loop before K iterations.

MC-CP for Image Classification
For image classification, we combine our Adaptive Monte Carlo dropout method with conformal prediction to form MC-CP, shown in Algorithm 2. MC-CP is split into two steps, conformal calibration and prediction.First, a test dataset is split into calibration and validation sets.Platt scaling is then performed on the pre-trained model using the calibration dataset.Next, we calculate the conformal scores for each input image in the training set, which can then be used to calculate the quantile threshold q.
During the prediction stage of MC-CP, we invoke the adaptive MC dropout method, with the selected hyperparameters, for each new input image.This invocation returns the mean prediction and variance of the possible classes of the image.The final prediction set can then be determined by calculating the cumulative softmax output for all classes and then including the classes from most to least likely that do not exceed the quantile threshold.In Section , we show how MC-CP outperforms other state-of-the-art conformal prediction techniques, with modest computational overheads.

Conformal Prediction
1: Mean softmax: retrieve softmax and variance from Adaptive Monte Carlo Dropout(f, v, K, δ, P ).2: Prediction Interval: output the prediction interval for unseen validation data v.

MC-CP for Regression
We also develop an extension of MC-CP for deep quantile regression, shown in Algorithm 3.This is also split up into calibration and prediction steps.To calculate the conformal scores, the magnitude of error for the desired quantiles is estimated.Next, the threshold can be calculated using the calibration dataset.
For the prediction stage of MC-CP for deep quantile regression, once again, the adaptive MC Dropout method is called, with the desired hyperparameters, for each data point in the validation dataset.Finally, a prediction interval is calculated for both quantiles on an unseen data point in the validation set using the calculated threshold.In Section , we show how MC-CP outperforms regular deep quantile regression and the CQR method.
For regression, we use the following five benchmarks: Boston Housing (Harrison and Rubinfield 1978), Abalone (Nash et al. 1995)

Image Classification Results
Classification Accuracy.The accuracy results of five different methods against each of the datasets are shown in Table 1.The methods tested against MC-CP were a baseline CNN, the same CNN with MC dropout applied, Naive conformal prediction (Angelopoulos and Bates 2022), and RAPS.Results show that not only does MC-CP have increased accuracy in comparison to baseline and state-ofthe-art conformal prediction methods, but it also does so with less deviation between runs.In particular, we emphasise that our method consistently increases accuracy and yields a lower standard deviation on difficult datasets such as CIFAR-10, CIFAR-100 and Tiny ImageNet.Further, and as expected, conformal prediction methods can drastically improve accuracy compared to baseline methods, such as regular CNN and MC dropout.However, MC-CP improves accuracy substantially with less deviation between runs, highlighting its consistency with Naive CP and RAPS.Singleton and Mixed Predictions.Next, we compare the percentage and accuracy of singleton and non-singleton (mixed) predictions for all three conformal prediction methods on CIFAR-10 (Figure 2).Naive CP is more likely to predict singleton values, whereas our method is least likely.When a model is not confident about its prediction, CPbased methods should desirably increase the prediction set size to account for this uncertainty and, hopefully, include the correct class in the larger prediction set.The comparison of singleton and non-singleton results in Figure 2 pro- vides evidence that our method correctly increases the set size to improve accuracy.In fact, for both singleton and nonsingleton set sizes, our method performs with the highest accuracy, also exhibiting a consistent behaviour, as indicated by the low amount of variance between runs.
An argument can be made that making the set size large enough could cover nearly all the classes, and this behaviour could reflect a higher accuracy.Comparing these results with the mean set sizes in Table 1, we can see that all methods only cover a portion of the classes in their mean set sizes.
Confidence of Predictions.We evaluated whether MC-CP could result in a more confident model than traditional conformal prediction methods, thus providing improved accuracy.Figure 3 shows the mean highest softmax output for every CP method for CIFAR-10, CIFAR-100, and Tiny Ima-geNet.Compared to Naive CP and RAPS, our method shows an increase in confidence across all benchmarks.Looking closely at larger-scale datasets, such as CIFAR-100 and Tiny ImageNet, MC-CP is substantially more confident in its predictions.We also observe, in Figure 3, that MC-CP consistently has a smaller standard deviation between runs than the other methods.
Prediction Sets Size.We have already shown how the accuracy of each method has been tested at scale using CIFAR-100.However, this only reflects a portion of the performance of each method at scale and doesn't highlight any of its weaknesses.The 'Prediction Sizes' column in Table 1 shows the mean set size and variance for Naive CP, RAPS, and MC-CP on the five datasets.The results on CIFAR-10 show that Naive CP has the smallest mean set size, but this does not reflect its accuracy.Looking at the CIFAR-100 results, we can see that Naive CP has the smallest mean again, but its variance is substantially larger than the other results.In fact, we observed that Naive CP had set sizes ranging from 1 to 86, which indicates that the method cannot cope effectively with large-scale datasets with many (potential) classes.For both datasets, MC-CP achieves a smaller mean than RAPS and has less deviation around the mean.For CIFAR-100, RAPS  has set sizes ranging from 33 to 59, whereas MC-CP has set sizes ranging from 30 to 52.These results show how the MC-CP can boost confidence in conformal prediction algorithms and achieve better results.Overall, we observe that advanced CP algorithms, like RAPS, tend to overestimate their predictions, and MC-CP reduces this overestimation.
We also demonstrate that our MC-CP method works well with models at scale by assessing its capabilities using the VGG16 and VGG19 models on the Tiny ImageNet dataset.Table 2 shows the reduced prediction set sizes for these models.The results on the larger DL models align with those shown in Table 1, except in smaller magnitudes.Accuracy of Classes.We next validated that MC-CP was not just doing significantly better than other methods in one or two classes but that indeed performs better for nearly all classes.Table 3 shows the mean accuracy for all methods for each class in the CIFAR-10 dataset.We again see the trend where MC-CP increases the accuracy in comparison to the other methods, and the deviation between runs is also reduced.MC-CP consistently achieves an accuracy of approximately 97-99%, showing that it does improve general accuracy, not just of a few classes.The Frog class is the sole outlier where Naive CP achieves a higher accuracy, but this appears to be an outlier for that model; MC-CP still achieves a high mean accuracy of 99.02% ± 0.59.Adaptive MC Dropout. Figure 4 shows the convergence in each class variance for an example image from the CIFAR- 10 dataset.We observe that at approximately 200 forward passes, the variance difference of all classes is below the δ threshold, and the patience counter starts increasing with every new iteration.However, at approximately 205 forward passes, the variance difference for classes Ship and Automobile spikes above the threshold; this is due to the stochastic nature of MC dropout.After 246 forward passes, all classes drop below the threshold, and the MC-CP procedure finishes early ten iterations later.
We also performed a sensitivity analysis of adaptive MC dropout to assess the impact of the threshold δ and patience P on its performance.Table 4 shows the various combinations of δ and P values used in these experiments.As P increases and δ decreases (from top left to bottom right), we notice an increase in the mean number of forward passes yielding a corresponding reduction in test error (i.e., accuracy increase) and prediction set size.As expected, for δ = 0.00001, P = 100 (bottom right) we obtain the traditional MC dropout, where the forward passes equals K = 1000.
Finally, we demonstrate that adaptive MC dropout can save resources by comparing its execution overheads against traditional MC Dropout for K = 1000, δ=5e-4, P =10.Traditional MC dropout performed all 1000 forward passes on CIFAR-10, and each image inference took an average of 35.52 ± 0.42 seconds.Adaptive MC Dropout averaged 500.21±196.37passes on all images and took an average of 17.99 ± 7.09 seconds.The ability of our method to diagnose convergence led to ≈ 50% faster execution, meaning that the other ≈ 500 forward passes were not needed.Considering memory consumption, as expected, both methods use the same memory (≈1.07GB/≈1.08GBfor regular/adaptive MC Dropout) when training a full model plus inference on a dataset.

Regression Results
Regression Accuracy and Coverage.In deep quantile regression, the mean absolute error (MAE) provides the magnitude of errors between the predicted quantiles and the true quantiles.Since MAE is less sensitive to outliers, we use it instead of (root) mean squared error.We also compute the empirical coverage, which measures how often the predicted quantiles contain the true statistical quantile.Similarly to image classification, the objective is for the posterior prediction set to contain the true quantile.Table 5 shows the MAE end empirical coverage for four different methods on the Boston Housing, Abalone, Blog Feedback, Concrete Strength and Protein datasets.We evaluated MC-CP against a baseline deep quantile regressor, the same deep quantile regressor with MC dropout, and conformalized quantile regression (CQR), the state-of-the-art CP regression method.
Looking at MAE, the traditional deep quantile regression model performs best across the five datasets.However, it also has a very low empirical coverage percentage across all five datasets.For example, in the Boston Housing dataset, the true data points are included in the predicted quantile only 22% of the time.Similarly, although MC dropout increases the coverage by a considerable amount across all datasets, this method consistently leads to a worse MAE overall.In fact, we observe a tradeoff between these two methods.A low MAE comes with a low coverage, whereas a high coverage induces a high MAE.
Considering the CP-based methods, we observe that CQR provides the 1 − α coverage guarantee specified for all datasets, i.e., approximately 90%.Furthermore, CQR achieves this coverage with an MAE comparable to the baseline method in our experiments.Our MC-CP method reaches the highest empirical coverage across all four datasets, but it does this with slightly higher overall MAE (but lower standard deviation) on average than CQR.Given, however, the improved empirical coverage of MC-CP and its very close MAE results, we can conclude that MC-CP delivers very competitive results against the state-of-the-art CP method for regression.This is a particularly important insight, especially in safety-critical applications where higher coverage is vital.We conclude our evaluation with Figure 5 which shows the predicted quantiles and coverage of the true values on an excerpt of the Boston Housing dataset.As expected, MC-CP yields slightly larger quantiles than CQR but has higher empirical coverage and misses fewer points.
Figure 5: Predicted quantiles (95%, 5%) of all four methods on a sample of the Boston Housing dataset.
Adaptive MC Dropout for Regression.Similar to Table 4, we performed sensitivity analysis on various combinations of δ and the patience value on deep quantile regression.Table 6 shows how different combinations affect MAE and coverage.We also visualised the quantiles for the various combinations, which can be seen in Figure 6.Similarly to the results shown in Table 4, a small δ and large patience show results comparable to traditional MC Dropout.It can be seen that with δ = 1e − 5, p = 10, we get considerable computational time saved with a compatible MAE to δ = 1e − 5, p = 100.
Similarly to the computational overheads investigation performed in image classification, we evaluated the overheads of traditional MC Dropout against Adaptive MC Dropout with the same parameters.Traditional MC Dropout performed all 1000 forward passes on the Boston Housing dataset, and each image inference took an average of 34.08 ± 1.51 seconds.Adaptive MC Dropout averaged 502.58 ± 56.94 forward passes on all images and took an average of 16.58 ± 2.91 seconds.Accordingly, we have obtained evidence that Adaptive MC Dropout was ≈ 50% faster again.

Conclusion and Future Work
Quantifying uncertainty in Deep Learning models is vital, especially when they are deployed in safety-critical applications.We introduced MC-CP, a hybrid uncertainty quantification method that combines a novel adaptive Monte Carlo dropout, informed by a coverage criterion to save resources during inference, with conformal prediction.MC-CP delivers robust prediction sets/intervals by exploiting the statistical efficiency of MC dropout and the distribution-free coverage guarantees of conformal prediction.Our evaluation in classification and regression benchmarks showed that MC-CP offers significant improvements over advanced methods, like MC dropout, RAPS and CQR.Our future work includes: (i) enhancing MC-CP to support object detection and segmentation tasks; (ii) performing a more extensive evaluation using larger benchmarks and DL models; and (iii) extending MC-CP to encode risk-related aspects in its analysis.

Figure 1 :
Figure 1: High-level overview of our MC-CP method for image classification.

Algorithm 1 :
Adaptive Monte Carlo Dropout Input: Model f , Input x, Maximum forward passes K, Threshold δ and Patience P Output: Mean prediction µ, and Variance σ 1: Count ← 0 2: Predictions ← [] 3: while (Count < P & size(Predictions) < K) do Algorithm 2: MC-CP for image classification Input: Model f , Test set, Maximum forward passes K, Threshold δ, and Patience P Output: Prediction set, and variance set Conformal Calibration 1: Split test set: split the test set in calibration c and validation v. 2: Calibrate: perform Platt scaling on the model using c. 3: Calculate conformal score: For each image in the training set, define

Algorithm 3 :
MC-CP for deep quantile regression Input: Model f , Test set, Maximum ensemble K, Threshold δ, and Patience P Output: Prediction interval, and variance Conformal Calibration 1: Split test set: split the test set in calibration c and validation v. 2: Calculate conformal score: for each data point in c, define

Figure 2 :
Figure 2: Percentage and accuracy of singleton and mixed predictions for Naive CP, RAPS, MC-CP on CIFAR-10.

Figure 4 :
Figure 4: Convergence of variance for each class during the Adaptive MC Dropout procedure.
• The hybrid MC-CP method that addresses major issues arXiv:2308.09647v2[cs.LG] 22 Jan 2024 common with CP methods, yielding significant improvements across several metrics and datasets.• A comprehensive empirical MC-CP evaluation against state-of-the-art UQ methods (MC Dropout, RAPS, CQR) on various benchmarks, including CIFAR-10, CIFAR-100, MNIST, Fashion-MNIST, and Tiny ImageNet.Paper Structure: Sections and discuss related UQ work and background material.Sections and present MC-CP and its empirical evaluation.Section concludes the paper.

Table 2 :
Test errors (%) and prediction sizes per UQ method for two large DL models on the Tiny ImageNet dataset.

Table 3 :
Mean accuracy (%) of classes for each method on the CIFAR-10 dataset.

Table 4 :
Sensitivity analysis on various threshold δ and patience P combinations on the CIFAR-10 dataset (K = 1000).

Table 5 :
Mean absolute error (MAE) and empirical coverage (%) for each method on the Boston Housing, Abalone, Blog Feedback, Concrete Strength and Protein datasets.