Empirical Risk Minimization With Relative Entropy Regularization

The empirical risk minimization (ERM) problem with relative entropy regularization (ERM-RER) is investigated under the assumption that the reference measure is a $\sigma $ -finite measure, and not necessarily a probability measure. Under this assumption, which leads to a generalization of the ERM-RER problem allowing a larger degree of flexibility for incorporating prior knowledge, numerous relevant properties are stated. Among these properties, the solution to this problem, if it exists, is shown to be a unique probability measure, mutually absolutely continuous with the reference measure. Such a solution exhibits a probably-approximately-correct guarantee for the ERM problem independently of whether the latter possesses a solution. For a fixed dataset and under a specific condition, the empirical risk is shown to be a sub-Gaussian random variable when the models are sampled from the solution to the ERM-RER problem. The generalization capabilities of the solution to the ERM-RER problem (the Gibbs algorithm) are studied via the sensitivity of the expected empirical risk to deviations from such a solution towards alternative probability measures. Finally, an interesting connection between sensitivity, generalization error, and lautum information is established.


I. INTRODUCTION
I N statistical machine learning, the problem of empirical risk minimization (ERM) with relative entropy regularization (ERM-RER) has been the workhorse for building probability measures on the set of models, without any additional assumption on the statistical description of the datasets.See for instance [3]- [5] and [6].Instead of additional statistical assumptions on the datasets, which are typical in Bayesian methods [7], relative entropy regularization requires a reference probability measure on the set of models, which is external to the ERM problem.Often, such a reference measure represents prior knowledge or side information and is chosen for guiding the search of models towards those inducing low empirical risks with high probability over seen and unseen datasets.From this perspective, the reference measure can be seen as an additional degree of freedom to improve the generalization capabilities of machine learning algorithms based on ERM-RER, e.g, Gibbs algorithms [5], [8]- [17] and [18].This new degree of freedom is one of the main motivations for regularizing the ERM problem using relative entropy, or more generally, any f -divergence regularization, as discussed in [19], [20] and [21].Beyond probability measures, as shown in this paper, the reference measure can be any σ-finite measure with arbitrary support.The flexibility introduced by this generalization becomes particularly relevant for the case in which priors are available in the form of probability distributions that can be evaluated up to some normalizing factor, cf.[22], or cannot be represented by probability distributions, e.g., equal preferences among elements of infinite countable sets.For some specific choices of σ-finite reference measures, the ERM-RER boils down to particular cases of special interest: (i) the information-risk minimization problem presented in [23]; (ii) the ERM with differential entropy regularization (ERM-DiffER); and (iii) the ERM with discrete entropy regularization (ERM-DisER).See for instance [24], [25] and references therein.From this perspective, the proposed ERM-RER formulation yields a unified mathematical framework that comprises a large class of problems.
When the reference measure is a probability measure, the solution to the ERM-RER problem is known to be unique and correspond to a Gibbs probability measure.Such a Gibbs probability measure has been studied using measure theoretic and information theoretic notions in [9], [23], [26]- [33]; statistical physics in [3]; PAC (Probably Approximatively Correct)-Bayesian learning theory in [34]- [37]; and proved to be of particular interest in classification problems in [5], [12], [20], [24], [38], [39] and [40].In the general case in which the reference is a σ-finite measure, a solution to the ERM-RER problem does not always exist.Nonetheless, if it exists, it is shown to be a unique Gibbs probability despite the fact that its partition function is defined with respect to a σ-finite measure.The condition for the existence is mild and is always satisfied when the reference measure is a probability measure, as highlighted above.Interestingly, such a solution is mutually absolutely continuous with the reference measure and most of the properties known for the classical ERM-RER problem are shown to hold in the most general case.For instance, under certain conditions, the empirical risk observed when models are sampled from the ERM-RER-optimal probability measure is a sub-Gaussian random variable that exhibits a PAC guarantee for the ERM problem without regularization.
When the solution to the ERM-RER problem is used to sample models to label unseen patterns, the process is known as the Gibbs algorithm.One of the traditional performance metrics to evaluate the generalization capabilities of machine learning algorithms is the generalization error, for which closed-form expressions in terms of information measures are presented in [15].When the reference measure is a probability measure, a closed-form expression for the generalization error of the Gibbs algorithm is presented in [9], while upper bounds have been derived in [18], [23], [31]- [37], [41]- [55], and references therein.In this work, a new performance metric coined sensitivity, which quantifies the variations of the expected empirical risk due to deviations from the solution of the ERM-RER problem is introduced.The sensitivity is defined as the difference between two quantities: (a) The expectation of the empirical risk with respect to the solution to the ERM-RER problem; and (b) the expectation of the empirical risk with respect to an alternative measure.The absolute value of the sensitivity is shown to be upper bounded by a term that is proportional to the squared-root of the relative entropy of the alternative measure with respect to the ERM-RERoptimal measure.Such bound allows providing lower and upper bounds on the expected empirical risk after a deviation from the ERM-RER-optimal measure towards an alternative probability measure.More interestingly, the expectation (with respect to the probability distribution of the datasets) of the sensitivity to deviations to a specific measure is shown to be equal to the generalization error of the Gibbs algorithm.Using this result, the closed-form expression for the generalization error of the Gibbs algorithm presented in [9] is shown to hold even in the case in which the reference measure is a σ-finite measure.Moreover, the generalization error is shown to be upper bounded by a term that is proportional to the squared-root of the lautum information between the models and the datasets, cf.[56].This bound is reminiscent of the result in [33, Theorem 1] in which a similar bound is presented using the mutual information instead of the lautum information.While [33, Theorem 1] follows immediately from the variational representation of relative entropy, c.f., [57,Lemma 4.18 (Transportation Lemma)], the new result follows from the fact that the empirical risk when models are sampled from the ERM-RER-optimal probability measure is a sub-Gaussian random variable.Interestingly, the new upper-bound does not require any of the conditions in [33, Theorem 1].
The remainder of this work is organized as follows.Section II introduces two optimization problems: the ERM and the ERM-RER.The asymmetry of the relative entropy is analyzed in the context of the ERM-RER and two variants, coined Type-I and Type-II, are distinguished.The former considers the case in which the regularization is the relative entropy of the optimization measure with respect to the reference measure.The latter considers a regularization by the relative entropy of the reference measure with respect to the optimization measure.Section III presents the solution to the ERM-RER problem in the general case and introduces its main properties.Section IV introduces two new classes of reference measures and the solution of the ERM-RER problem is shown to exhibit different properties for each class.This section ends by studying the ERM-RER problem in the special case in which the reference measure is a Gibbs probability measure.This special case exhibits a solution that is identical to the solution to an ERM-RER problem whose reference measure is the same used to build the above mentioned Gibbs measure.Section V studies the properties of the log-partition function of the ERM-RER-optimal probability measure.The first, second, and third cumulants of the empirical risk when the models are sampled from the ERM-RER-optimal measure and the reference measure are respectively characterized.Section VI and Section VII study the properties of the expectation and variance of the empirical risk when the models are sampled from the ERM-RER-optimal probability measure.This mean and variance are compared with the mean and variance of the empirical risk when models are sampled from the reference measure.Section VIII introduces several explicit expressions for the cumulant generating function of the empirical risk when the models are sampled from the ERM-RER-optimal measure.Using these equivalent expressions, under a specific condition, it is shown that the empirical risk is a sub-Gaussian random variable when models are sampled from the ERM-RER-optimal measure.Section IX describes the monotonic concentration of the ERM-RER-optimal probability measure when the regularization factor tends to zero.Section X show that the empirical risk when the models are sampled from the ERM-RER-optimal probability measure exhibits a PACtype guarantee with respect to the ERM problem without regularization.Finally, Section XI studies the sensitivity of the expected empirical risk with respect to deviations from the ERM-RER-optimal measure to alternative measures and shows connections with the generalization error and the lautum information.Section XII ends this work with conclusions and a discussion on the results.

II. EMPIRICAL RISK MINIMIZATION (ERM)
Let M, X and Y, with M ⊆ R d and d ∈ N, be sets of models, patterns, and labels, respectively.A pair (x, y) ∈ X × Y is referred to as a labeled pattern or as a data point.Given n data points, with n ∈ N, denoted by (x 1 , y 1 ), (x 2 , y 2 ), . .., (x n , y n ), the corresponding dataset is represented by the tuple z = (x 1 , y 1 ) , (x 2 , y 2 ) , . . ., (x n , y n ) ∈ (X × Y) n .(1) Let the function f : M × X → Y be such that the label assigned to the pattern x according to the model θ ∈ M is f (θ, x).Let also the function be such that given a data point (x, y) ∈ X ×Y, the risk induced by a model θ ∈ M is ℓ (f (θ, x), y).In the following, the risk function ℓ is assumed to be nonnegative and for all y ∈ Y, ℓ (y, y) = 0.
The empirical risk induced by the model θ, with respect to the dataset z in (1) is determined by the function L z : M → [0, +∞], which satisfies Using this notation, the ERM consists of the following optimization problem: Let the set of solutions to the ERM problem in (4) be denoted by T (z) ≜ arg min θ∈M L z (θ) . ( Note that if the set M is finite, the ERM problem in (4) always possesses a solution, and thus, |T (z)| > 0. Nonetheless, in general, the ERM problem might not necessarily possess a solution, i.e., |T (z)| = 0.

A. Notation and Main Assumptions
In the following, given a measurable space (Ω, F ), the notation △ (Ω, F ) is used to represent the set of σ-finite measures that can be defined over (Ω, F ).Given a measure Q ∈ △ (Ω, F ), the subset △ Q (Ω, F ) of △ (Ω, F ) contains all σ-finite measures that are absolutely continuous with respect to the measure Q.Alternatively, the subset ▽ Q (Ω, F ) of △ (Ω, F ) contains all probability measures P such that Q is absolutely continuous with respect to P .Given a set A ⊂ R d , the Borel σ-field over A is denoted by B (A).
The main assumption adopted in this work is that the function L z in (3) is measurable with respect to the Borel measurable spaces (M, B (M)) and ([0, +∞], B ([0, +∞])).

B. Relative Entropy Extended to σ-Finite Measures
In this work, the relative entropy, which is usually defined for probability measures, is extended to σ-finite measures.
Definition 1 (Generalized Relative Entropy): Given two σfinite measures P and Q on the same measurable space, such that P is absolutely continuous with respect to Q, the relative entropy of P with respect to Q is where the function dP dQ is the Radon-Nikodym derivative of P with respect to Q.
The relative entropy exhibits a property often referred to as the information inequality [58, Theorem 2.6.3] in the case of probability measures on (Ω, F ), with Ω a countable set.The following theorem explores this property in a more general scenario.
Theorem 1: If P and Q are both probability measures on a general measurable space (Ω, F ), with P absolutely continuous with respect to Q, then, with equality if and only if P and Q are identical.
Proof: Consider the function f : [0, ∞) → R such that for all x ∈ (0, +∞), f (x) = x log(x) and f (0) = 0. Note that f is strictly convex.If P and Q are both probability measures on the measurable space (Ω, F ), the following holds: where the inequality (11) follows from Jensen's inequality [59,Section 6.3.5].Equality in (11) holds if and only if for all x ∈ supp Q, dP dQ (x) = 1, which implies that both P and Q are identical.This completes proof.
If Q is not a probability measure, then it might be observed that D (P ∥Q) < 0. Consider for instance the case in which P is a zero-mean Gaussian probability measure with variance σ 2 and Q is the Lebesgue measure on (R, B (R)).Hence, the Radon-Nikodym derivative dP dQ is the Gaussian probability density function such that for all x ∈ R, Under this assumption, the relative entropy of P with respect to Q is the negative of the differential entropy of P .That is, A central observation from ( 14) is that the equality D (P ∥Q) = 0 does not necessarily imply that P and Q are identical measures.For instance, when σ 2 = 1 2πϵ in (15), it holds that D (P ∥Q) = 0, while P is a Gaussian probability measure and Q is the Lebesgue measure.
The following property, known for the case of probability measures as the joint-convexity of the relative entropy, is extended by the following theorem.
Theorem 2: Let P 1 and P 2 be two probability measures and Q 1 and Q 2 be two σ-finite measures, all on the same measurable space.For all i ∈ {1, 2}, let P i be absolutely continuous with respect to Q i .Then, for all λ ∈ [0, 1], Equality in (17) holds if and only if Proof: The proof is presented in Appendix A.

C. ERM with Relative Entropy Regularization
Given a dataset, the expected empirical risk induced by a measure P ∈ ∆ (M, B (M)) is defined as follows.
Definition 2 (Expected Empirical Risk): Let P be a probability measure in ∆ (M, B (M)).The expected empirical risk with respect to the dataset z in (1) induced by the measure P is where the function L z is defined in (3).
The ERM-RER problem is parametrized by a σ-finite measure in △ (M, B (M)) and a positive real, which are referred to as the reference measure and the regularization factor, respectively.Let Q ∈ △ (M, B (M)) be a σ-finite measure and let λ be a positive real.The ERM-RER problem, with parameters Q and λ, consists of the following optimization problem: where the dataset z is in (1), and the functional R z is defined in (18).

D. Type-I and Type-II Relative Entropy Regularization
The optimization problem in (19) is coined Type-I ERM-RER in [60] in the aim of distinguishing it from the optimization problem which is coined Type-II ERM-RER.
The Type-II ERM-RER problem in (20), when Q is a probability measure, exhibits a solution that is identical to the solution to the following Type-I ERM-RER problem [60, Theorem 1]: log(β +L z (ν))dP (ν)+D(P ∥Q), (21a) where β is a constant chosen to satisfy Essentially, by appropriately transforming the objective function, an equivalence can be established between Type-I and Type-II ERM-RER problems.Hence, without loss of generality, the remainder of this work focuses exclusively on Type-I ERM-RER, which is simply referred to as ERM-RER.

III. THE SOLUTION TO THE ERM-RER PROBLEM
The solution to the ERM-RER problem in (19) is presented in terms of two objects.First, the function K Q,z : R → R ∪ {+∞} such that for all t ∈ R, with L z in (3).Second, the set K Q,z ⊂ (0, +∞), which is defined by The notation for the function K Q,z and the set K Q,z are chosen such that their parametrization by (or dependence on) the dataset z in (1) and the σ-finite measure Q in (19) are highlighted.
The following lemma describes the set K Q,z .19) is a probability measure, then, the set Proof: The proof is presented in Appendix B.
Using this notation, the solution to the ERM-RER problem in (19) is presented by the following theorem.(23), the solution to the optimization problem in ( 19) is a unique probability measure, denoted by P Θ|Z=z , which satisfies for all θ ∈ supp Q, where the function L z is defined in (3) and the function K Q,z is defined in (22).
Proof: The proof is presented in Appendix C.
Contrary to the ERM problem in (4), which does not necessarily possess a solution, the ERM-RER problem in (19) always possess a solution when Q is a probability measure.This is essentially because the set K Q,z is the set of all positive reals (Lemma 1), and thus, the condition λ ∈ K Q,z is always verified.On the contrary, when Q is a σ-finite measure, the solution to the ERM-RER problem in (19) depends on whether Θ|Z=z in (25), which is unique and corresponds to a Gibbs probability measure [61].The function K Q,z is often referred to as the log-partition function, see for instance, [62,Section 7.3.1].
The following lemma shows that the Radon-Nikodym derivative in (25) is both nonnegative and finite.
Lemma 2: The Radon-Nikodym derivative Moreover, it is strictly positive almost surely with respect to Q.
Proof: The proof is presented in Appendix D.
Theorem 3 has shown that the probability measure P (Q,λ) Θ|Z=z is absolutely continuous with respect to the measure Q.The following lemma shows that the converse is also true.

Lemma 3:
The σ-finite measure Q and the probability measure Θ|Z=z in (25) are mutually absolutely continuous.Proof: The proof is presented in Appendix E.
The relevance of Lemma 3 is that it shows that if λ ∈ K Q,z , the corresponding collections of negligible sets with respect to the measures P (Q,λ) Θ|Z=z and Q are identical.The following lemma shows that the negligible sets with respect to the measure P Θ|Z=z in (25) are invariant with respect to λ. (23), assume that the probability measures P Particular assumptions on the set M and the reference measure Q lead to well-known instances of the ERM-RER problem in (19), as discussed hereunder.

A. Examples
Three examples are of particular interest: (a) The set M ⊂ R d is countable and the measure Q is the counting measure in (M, B (M)), which leads to the ERM-DisER problem; (b) The set M is an uncountable subset of R d , and Q is the Lebesgue measure on (M, B (M)), which leads to the ERM-DiffER problem; and (c) The set M and the measure Q form a Borel probability measure space (M, B (M) , Q), which leads to the information-risk minimization problem. 1) ERM with Discrete Entropy Regularization: When the set M ⊂ R d is countable and the σ-finite measure Q in (19) is the counting measure in (M, B (M)), given a probability measure P ∈ △ (M, B (M)), the Radon-Nikodym derivative dP dQ is a probability mass function, denoted by p. Thus, the relative entropy D (P ∥Q) is equivalent to the negative of the discrete entropy induced by p [58, Chapter 2], denoted by H(p).In this case, the ERM-RER in (19) can be re-written as the following ERM-DisER problem: where the optimization domain in (27) is the set of probability mass functions that can be defined over the measure space △ (M, B (M)).In this special case, the probability measure Θ|Z=z in (25) whose probability mass function is the solution to the ERM-DisER problem in (27) satisfies which describes the discrete Gibbs probability measure on △ (M, B (M)), with temperature parameter λ, and energy function L z in (3).

2) ERM with Differential Entropy Regularization:
When M ⊆ R d is uncountable and the σ-finite measure Q in (19) is the Lebesgue measure in (M, B (M)), for all probability measures P ∈ ∆ Q (M, B (M)), the Radon-Nikodym derivative dP dQ is a probability density function, denoted by g.Thus, the relative entropy D (P ∥Q) is equivalent to the negative of the differential entropy induced by g [58, Chapter 8], denoted by h(g).In this special case, the ERM-RER in (19) can be re-written as the following ERM-DiffER problem: where the optimization domain in (29) is the set of probability density functions that can be defined over the measure space (M, B (M)).The probability measure P Θ|Z=z in (25) whose probability density function is the solution to the ERM-RER problem in (29) satisfies which describes the absolutely continuous Gibbs probability measure with temperature parameter λ and energy function L z in (3).

B. Bounds on the Radon-Nikodym Derivative
The Radon-Nikodym derivative 25) is greater for models inducing smaller empirical risks, as shown by the following corollary of Theorem 3.
The intuition that follows from corollary 1 is that under the assumption that the ERM problem in (4) possesses a solution in the support of the reference measure, i.e., T (z) ∩ supp Q is not empty, with T (z) in ( 5), the maximum of the function 25) is achieved by the models in 25) is either the probability mass function in (28) or the probability density function in (30), Corollary 1 shows that the elements of the set T (z) ∩ supp Q are the modes of the corresponding probability density function or probability mass function.

C. Asymptotes of the Radon-Nikodym Derivative
The following lemma describes the asymptotic behavior of the Radon-Nikodym derivative 25) when the regulariation factor increases, i.e., λ → +∞ and the reference measure Q is a probability measure.
Proof: From Theorem 3, it follows that for all θ ∈ supp Q, = 1 dQ (ν) where the function L z is defined in (3).This completes the proof.
Lemma 5 unveils the fact that, when Q is a probability measure, in the limit when λ → +∞, both probability measures Θ|Z=z and Q are identical.This is consistent with the fact that when λ tends to infinity, the optimization problem in (19) boils down to exclusively minimizing the relative entropy.Such minimum is zero and is observed when both probability measures P (Q,λ) Θ|Z=z and Q are identical (Theorem 1).Such intuition breaks when the reference measure is a σ-finite measure, but not a probability measure.In such a case, the relative entropy term in (19) might be negative and a minimum might not exist.See for instance, the case of the relative entropy between a Gaussian measure and the Lebesgue measure in (14), which satisfies (16).
The limit of the Radon-Nikodym derivative dP (Q,λ) Θ|Z=z dQ in (25), when λ tends to zero from the right, can be studied using the following set where the function L z is defined in (3) and δ ∈ [0, +∞).In particular consider the nonnegative real Let also L ⋆ Q,z be the following level set of the empirical risk function L z in (3): Using this notation, the limit of the Radon-Nikodym derivative dP (Q,λ) Θ|Z=z dQ in (25), when λ tends to zero from the right, is described by the following lemma.(38) and Q the σ-finite measure in (19), then for all θ ∈ supp Q, the Radon-Nikodym derivative Proof: The proof is presented in Appendix G. (38).Under this assumption, from Lemma 6, it holds that the probability measure Θ|Z=z asymptotically concentrates on the set L ⋆ Q,z when λ tends to zero from the right.More specifically, note that for all measurable sets A ⊆ L ⋆ Q,z ∩ supp Q, it holds that = lim where the equality in (42) follows from Lemma 2 and the dominated convergence theorem [59, Theorem 2.6.9].The equality in (43) follows from Lemma 6.In the particular case in which A = L ⋆ Q,z in (45), it holds that lim = 1, which verifies the asymptotic concentration of the probability measure 25) is a constant among the elements of the set L ⋆ Q,z .This can be assimilated to a uniform distribution of the probability among the elements of the set L ⋆ Q,z in the limit when λ tends to zero from the right, as previously highlighted in [26]- [28] and [29].This becomes more evident in the case in which the set M is finite and Q is the counting measure.In such a case, the asymptotic probability of each of the elements in L ⋆ Q,z when λ tends to zero from the right is (38).Under this assumption, in the asymptotic regime when λ → 0, the measure P Θ|Z=z is not a probability measure but either the trivial measure or the infinite measure.This is typically the case in which M = R d , the measure Q is absolutely continuous with respect to the Lebesgue measure, and the solution to the ERM problem in (4) has a unique solution on the support of Q, i.e., L ⋆ Q,z = T (z) and |T (z)| = 1, which implies Q(L ⋆ Q,z ) = 0.An interesting question, which is left out of the scope of this paper, is the rate at which P Θ|Z=z converges to such limiting measure.The interested reader is referred to [26], [29], and references therein.
The following lemma shows that independently of whether the set L ⋆ Q,z is negligible with respect to the measure Q, the limit when λ tends to zero from the right of P Proof: The proof is presented in Appendix H.
Note that if the ERM problem in (4) possesses at least one solution and such solution is within the support of the measure Q, i.e., T (z) ∩ supp Q ̸ = ∅, then, when λ tends to zero from the right, the probability measure P (Q,λ) Θ|Z=z asymptotically concentrates on the solution (or the set of solutions within the support of Q) to the ERM problem in (4).Alternatively, in the case in which L ⋆ Q,z ∩ T (z) = ∅, when λ tends to zero from the right, the probability measure P (Q,λ) Θ|Z=z asymptotically concentrates on a set that does not contain the set of solutions to the ERM problem in (4).This observation leads to the introduction to two new classes of reference measures, namely, coherent and consistent measures, in the following section.

IV. REFERENCE MEASURES
This section introduces two classes of reference measures, namely coherent and consistent measures, and discusses the special case of Gibbs reference measures.

A. Coherent and Consistent Reference Measures
A class of reference measures of particular importance to establish connections between the set of solutions to the ERM problem in (4) and the solution to the ERM-RER problem in (19) is that of coherent measures.Let ρ ⋆ ⩾ 0 be the infimum of the empirical risk L z in (3).That is, Using this notation, coherent measures are defined as follows.
When the reference measure Q in the EMR-RER problem in ( 19) is a coherent measure, it holds that for all δ > ρ ⋆ , the set L z (δ) in (36) exhibits positive probability with respect to the probability measure Θ|Z=z in (25).The following lemma highlights this observation.
Proof: The proof is presented in Appendix I.
Under the assumption that the ERM problem in (4) possesses a solution, it holds that Hence, when the σ-finite measure Q in ( 19) is coherent, then with δ ⋆ Q,z in (37) and ρ ⋆ in (47), which implies that with T (z) in ( 5) and L ⋆ Q,z in (38).This observation, together with Lemma 7, leads to the following result.
Lemma 9: Assume that the ERM problem in (4) possesses a solution.Then, the probability measure P (Q,λ) Θ|Z=z in (25) and the sets T (z) in (5) and L ⋆ Q,z in (38) satisfy if and only if the σ-finite measure Q in (19) is coherent.
Proof: The proof follows by observing that if Q is a coherent measure and the ERM problem in (4) possesses a solution, the inclusion in (52) holds.Thus, from Lemma 7, the equality in (53) holds.Alternatively, when the measure Hence, from Lemma 7, it follows that and completes the proof.
The relevance of coherent measures in ERM-RER problems is well highlighted by Lemma 9. Essentially, when the ERM problem in (4) possesses at least one solution, the concentration of the probability measure Θ|Z=z in (25) on the set (or a subset) of solutions to the ERM problem in (4) occurs asymptotically when λ tends to zero from the right, if only if the reference measure Q in (19) is coherent.Nonetheless, such asymptotic concentration is not a guarantee that for strictly positive values of λ in (19), the set T (z) in ( 5) and the measure P Θ|Z=z (T (z)) > 0. In order to ensure this, another class of reference measures, known as consistent measures, is introduced.
Note that every consistent measure is not necessarily coherent.For instance, if Q is consistent but δ ⋆ Q,z > ρ ⋆ , with ρ ⋆ in (47) and δ ⋆ Q,z in (37), then, for all δ ∈ (ρ ⋆ , δ ⋆ Q,z ), it follows that Q (L z (δ)) = 0, and thus, Q is not coherent.Alternatively, every coherent measure is not necessarily consistent.For instance, if L ⋆ Q,z = 1 and Q is coherent and absolutely continuous with respect to the Lebesgue measure, it follows that Q The relevance of consistent measures is highlighted by the following lemma.
Lemma 10: The probability measure Θ|Z=z in ( 25) and the set L ⋆ Q,z in (38) satisfy if and only if the σ-finite measure Q in ( 19) is consistent.
and thus, from the fact that the measure P Θ|Z=z in ( 25) is absolutely continuous with respect to Q, it holds that Moreover, for all θ ∈ L ⋆ Q,z , it holds that L z (θ) < +∞ and thus, from Lemma 2, it follows that which completes the proof.
The following lemma highlights a central property of consistent measures when the ERM problem in (4) possesses a solution.
Lemma 11: Assume that the ERM problem in (4) possesses a solution in the support of Q.The probability measure P (Q,λ) Θ|Z=z in (25) and the sets T (z) in ( 5) and L ⋆ Q,z in (38) satisfy if and only if the σ-finite measure Q in ( 19) is consistent.
Proof: The proof follows from Lemma 10 by noticing that when the ERM problem in (4) possesses a solution in the support of Q, the inclusion in ( 52) holds.
The distinction between coherent and consistent measures becomes more evident under certain conditions.Consider the case in which M is finite.In this case, if the solution to the ERM problem in ( 4) is in the support of the σ-finite measure Q, then Q is both coherent and consistent.This is essentially because all measurable singletons (models) in supp Q exhibit positive measure with respect to Q.Alternatively, if the solution to the ERM problem in ( 4) is not in the support of Q, then Q is consistent but not coherent.Consider the case in which M is the set R d ; the loss function ℓ in (2) is continuous; and the ERM problem in (4) admits a unique solution.In this case, any probability measure Q absolutely continuous with respect to the Lebesgue measure is a coherent measure, but it is not a consistent measure.Alternatively, if the set of solutions to the ERM problem in (4) exhibits positive Lebesgue measure, then, the measure Q is both coherent and consistent.

B. Gibbs Reference Measures
In model selection, a natural idea is to proceed by successive approximations in the seek of lower computation complexity.
From this perspective, one might wonder whether the solution to a current instance of an ERM-RER problem might serve as reference measure for the next instance.In this section, it is shown that this yields no benefit.Composing two successive ERM-REM problems boils down to a unique ERM-RER problem with the initial reference measure and a particular regularization factor.Under the assumption that λ ∈ K Q,z , with K Q,z in ( 23), the problem of interest is: where α > 0; the reference measure Θ|Z=z , which satisfies (25), is the solution of the ERM-RER problem in (19); and the functional R z is defined in (18).From Theorem 3, the solution to the ERM-RER problem in (59), which is denoted by The log-partition functions K Q,z in (22) and in (60) are strongly related, as shown by the following lemma.
Lemma 12: The functions K Q,z in ( 22) and K P (Q,λ) Θ|Z=z ,z in (60) satisfy for all t ∈ R, Moreover, for all t ⩽ 0, Proof: The proof of ( 61) relies on the fact that for all Θ|Z=z ,z in (60) satisfies where the equality in (65) follows from (25).Moreover, from Lemma 15, it follows that the function K P (Q,λ) Θ|Z=z ,z is continuous and nondecresing.Let s ⋆ ∈ R ∪ {+∞} be defined by If s ⋆ = +∞, then for all t ∈ R, K P (Q,λ) Θ|Z=z ,z (t) < +∞, and the proof of ( 61) is completed.Alternatively, if s ⋆ < +∞, it follows that for all t > s ⋆ , λ < ∞ (due to the choice of λ).Hence, in this case, the equality in ( 61) is of the form +∞ = +∞.This completes the proof of (61).
The proof of (62) follows by noticing that for all t ⩽ 0 and for all θ ∈ supp Q, it holds that exp (t L z (θ)) ⩽ 1.Hence, which completes the proof.
The following lemma establishes that the solution to the ERM-RER problem in ( 59) is identical to the solution to another ERM-RER problem of the form with λ ∈ K Q,z , with K Q,z in (23), and whose solution, The formal statement is as follows.
Proof: For all θ ∈ supp Q, where the equality in (74) follows from the fact that the measure is absolutely continuous with respect to Θ|Z=z and P Θ|Z=z is absolutely continuous with respect to the measure Q; the equality in (75) follows from Lemma 12; and the equality in (77) follows from Theorem 3.For all measurable subsets A of M, the following holds: where the equality in (79) follows from (77).This completes the proof.
The following theorem establishes a relation between the solutions to the following optimization problems and with c > 0 and ω ∈ K Q,z , with K Q,z in (23), two constants; Θ|Z=z the probability measure in (25); and R z the functional in (18).
From Theorem 3, the solution to the ERM-RER problem in (83), which is denoted by where the function K Q,z is in (22).
The following theorem formalizes the relation between both optimization problems.
Theorem 4: Assume that c and ω in ( 82) and (83) satisfy with Θ|Z=z and P Θ|Z=z being the probability measures in ( 25) and (84), respectively.Then, the solution to the optimization problem in (82) is the probability measure Proof: The proof is presented in Appendix J.

V. THE LOG-PARTITION FUNCTION
This section introduces some properties of the log-partition function K Q,z in ( 22) using the notion of separable empirical risk functions.

A. Separable Empirical Risk Functions
Separable empirical risk functions are defined with respect to a measure P ∈ △ (M).

Definition 5 (Separable Empirical Risk Function):
The empirical risk function L z in (3) is said to be separable with respect to a σ-finite measure P ∈ △ (M), if there exist a positive real c > 0 and two subsets A and B of M that are nonnegligible with respect to P , and for all (θ 1 , θ 2 ) ∈ A × B, In a nutshell, a nonseparable empirical risk function with respect to the measure Q is a constant almost surely.More specifically, there exists a real a ⩾ 0, such that From this perspective, nonseparable empirical risk functions exhibit little practical interest for model selection.
The definition of separability in Definition 5 and Lemma 3 lead to the following lemma.
Lemma 14: The empirical risk function L z in ( 3) is separable with respect to the σ-finite measure Q in (19) if and only if it is separable with respect to the probability measure P (Q,λ) Θ|Z=z in (25).
Proof: Consider first that the function L z is separable with respect to the σ-finite measure Q.Hence, there exist a positive real c > 0 and two subsets A and B of M that are nonnegligible with respect to Q, such that for all (θ 1 , θ 2 ) ∈ A × B the inequality in (86) holds.Hence, from (86) the following inequalities hold: This implies that Using the inequality in (90) and the facts that Q (A) > 0 and Q (B) > 0, the following holds and which implies that the function L z is separable with respect to the probability measure Θ|Z=z .Consider now that the function L z is separable with respect to the probability measure P (Q,λ) Θ|Z=z .Hence, there exist a positive real c > 0 and two subsets A and B of M that are nonnegligible with respect to Θ|Z=z , such that for all (θ 1 , θ 2 ) ∈ A×B the inequality in (86) holds.More specifically, P From Lemma 2 and the inequality in (86), it follows that for all pairs Θ|Z=z (B) > 0, it follows that Q (A) > 0 and Q (B) > 0, which implies that the function L z is separable with respect to the σ-finite measure Q.This completes the proof.
Lemma 14 shows that separable empirical risk functions, and only these functions, lead to ERM-RER-optimal probability measures from which models are sampled with different probabilities.For the case of nonseparable empirical risk functions, all models are sampled from the ERM-RER-optimal probability measure with the same probability.

B. Properties of the Log-Partition Function
The log-partition function K Q,z in ( 22) is a nondecreasing continuous convex function as shown by the following lemmas.
Lemma 15: The function K Q,z in ( 22) is nondecreasing and differentiable infinitely many times in the interior of {t ∈ R : K Q,z (t) < +∞}.
Proof: The proof is presented in Appendix K.
Moreover, it is strictly convex if and only if the empirical risk function L z in ( 3) is separable with respect to the σ-finite measure Q in (19).
Proof: The proof is presented in Appendix L.
In Lemma 15, it has been established that the log-partition function K Q,z in ( 22) is differentiable infinitely many times in the interval {t ∈ R : The following lemma provides explicit expressions for the first, second and third derivatives of the function K Q,z in (22).
Lemma 17: The first, second and third derivatives of the function K Q,z in (22), denoted respectively by K (1) Q,z , and where the function L z is defined in (3) and the measure Θ|Z=z satisfies (25).
Note that if there exists a δ > 0 such that the log-partition function is the cumulant generating function of the random variable The following lemma leverages this observation.
Lemma 18: Assume that Q in ( 19) is a probability measure and that there exists real δ > 0 such that the log-partition function K Q,z in ( 22) is differentiable within (−δ, δ).Then, the first, second and third derivatives of K Q,z , denoted respectively by K (1) Q,z , and K Q,z , satisfy Q,z (0) Q,z (0) where the function L z is defined in (3).
Proof: The proof follows along the same arguments of the proof of Lemma 17.

VI. EXPECTATION OF THE EMPIRICAL RISK
The mean of the random variable W in (99) is equivalent to the expectation of the empirical risk function L z with respect to the probability measure , with the functional R z in (18).
ä is referred to as the ERM-RER-optimal expected empirical risk to emphasize that this is the expected value of the empirical risk when models are sampled from the solution of the ERM-RER problem in (19).The following corollary of Lemma 17 formalizes this observation.
Corollary 2: The probability measure P where the functional R z and the function K Q,z are defined in ( 18) and (96), respectively.
The expected empirical risk R z Ä P (Q,λ) Θ|Z=z ä in (104) exhibits the following property.
where δ ⋆ Q,z is defined in (37).Moreover, the inequality in (105) is strict if and only if the function L z in (3) is separable with respect to the measure Q in (19).
Proof: The proof is presented in Appendix O.
In the asymptotic regime when λ tends to zero, the expected empirical risk R z Ä P , as shown by the following lemma.
where δ ⋆ Q,z is defined in (37).Proof: The proof is presented in Appendix P.
The following lemma determines the value of the objective function of the ERM-RER problem in (19) when it is evaluated at its solution.This result appeared first in [11, Lemma 3].
Lemma 20 (Lemma 3 in [11]): The probability measure Θ|Z=z in (25) and the σ-finite measure Q in (19) satisfy where the functional R z is defined in (18); and the function K Q,z is defined in (22).
where the function L z is defined in (3).Thus, where the functional R z is defined in (18).This completes the proof of (107).
From Lemma 3 and (109), it follows that which completes the proof of (108).
The following corollary of Lemma 20 characterizes the difference between the expected values of the random variables W and V in (99) and (100), respectively.
Θ|Z=z in ( 25) are both probability measures, then, The right-hand side of ( 116) is a symmetrized Kullback-Liebler divergence, also known as Jeffrey's divergence [69], between the measures Q and P ⩾ 0, which leads to the following corollary from Lemma 20. 19) is a probability measure, then, the probability measure where, the functional R z is defined in (18).

VII. VARIANCE OF THE EMPIRICAL RISK
In Lemma 15, it has been established that if there exists a δ > 0 such that the log-partition function K Q,z in ( 22) is finite within the open interval (−δ, δ), the log-partition function K Q,z is differentiable infinitely many times within the interval (−∞, δ).This together with the mean value theorem [70, Theorem 5.10] lead to the following characterization of the differences of the values K (2) 19) is a probability measure and there exists a δ > 0 such that the function K Q,z in ( 22) is differentiable within the open interval (−δ, δ), then for all t > 0, for some β ∈ (t, +∞), where the functions K Q,z , and are defined in (95).
Proof: The proof is an immediate consequence of Lemma 15 and the mean value theorem [70, Theorem 5.10].
The relevance of Lemma 21 lies on the fact that K (2) Q,z (0) are the variances of the random variables W in (99) and V in (100).See Lemma 17 and Lemma 18.Under the assumptions of Lemma 21, it follows that the function K , where δ > 0. Hence, for all t > 0, the function K (3) Q,z achieves a maximum and a minimum within the interval − 1 t , 0 .Such extrema allow providing lower and upper bounds on the variance Q,z (0) of the random variable V , as shown hereunder.
Corollary 5: If the measure Q in ( 19) is a probability measure and there exists a δ > 0 such that the function K Q,z in ( 22) is differentiable within the open interval (−δ, δ), then for all t > 0, where, Q,z (s) and ( 120) and the functions K Q,z , and K Q,z are defined in (95).
The inequality in (119) reveals that under the assumptions of Corollary 5, in the asymptotic regime when t → +∞, the variances of the random variables W in (99) and V in (100) are identical.Additionally, unlike the means K (1) Q,z (0) of the random variables W and V , which satisfy K (1) Q,z (0) depending on whether the function K Lemma 21 shows that the monotonicity of the expectation of the random variable W in (99) with respect to λ, stated by Theorem 5, is not a property exhibited by the variance nor the third cumulant.The following example highlights this observation.
Example 1: Consider the ERM-RER problem in (19), under the assumption that Q is a probability measure and the empirical risk function L z in (3) is such that for all θ ∈ M, where the sets A ⊂ M and M \ A are nonnegligible with respect to the reference probability measure Q.In this case, the function K Q,z in (22) satisfies for all λ > 0, The derivatives K Q,z , and K Q,z in (95) of the function K Q,z in (123) satisfy for all λ > 0, 2 ; and (125) Note that K (3) Assume that Q (A) ⩾ 1 2 .Thus, it holds that for all λ > 0, the inequality in (127) is always satisfied.This follows from observing that for all λ > 0, exp , for all decreasing sequences of positive reals λ 1 > λ 2 > . . .> 0, it holds that Alternatively, assume that Q (A) < 1 2 .In this case, the inequality in (127) is satisfied if and only if Hence, if Q (A) < 1 2 , then for all decreasing sequences of positive reals Moreover, for all decreasing sequences of positive reals The upperbound by 1  4 in (129), ( 131) and (132) follows by noticing that the value ää −1 and K (2) Example 1 provides important insights on the choice of the reference measure Q.Note for instance that when the reference measure assigns a probability to the set of models T (z) in (5) that is greater than or equal to the probability of suboptimal models M \ T (z), i.e., Q (T (z)) ⩾ 1  2 , the variance is strictly decreasing to zero when λ decreases.See for instance, Figure 1 and Figure 2.That is, when the reference measure assigns higher probability to the set of solutions to the ERM problem in (4), the variance is monotone with respect to the parameter λ.
Alternatively, when the reference measure assigns a probability to the set T (z) that is smaller than the probability of the set M \ T (z), i.e., Q (T (z)) < 1 2 , there exists a critical point for λ at ää −1 .See for instance, Figure 3.More importantly, such a critical point can be arbitrarily close to zero depending on the value Q (A).The variance strictly decreases when λ decreases beyond the value ää −1 increases the variance.
In general, these observations suggest that reference measures Q that allocate small measures to the sets containing the set T (z) might require reducing the value λ beyond a small threshold in order to observe small values of λ , which is the variance of the random variable W , in (99).These observations are central to understanding the concentration of probability that occurs when λ decreases to zero, as discussed in Section IX.
Q,z − 1 λ , and third central moment K where the term L −1 z (A) represents the set Q,z − 1 λ , and third central moment K Q,z − 1 λ , and third central moment Note that the random variable W in (99) induces the probability measure . The objective of this section is to study the properties of the cumulant generating function of the probability measure where the equality in (136) follows from [59, Theorem 1.6.12].
The following lemma provides an expression for J z,Q,λ in terms of the log-partition function K Q,z in (22).(23), then, the function J z,Q,λ in (135), verifies for all t ∈ R, with the function K Q,z in (22) and the function K Q,z in (95).Proof: The proof of (137) follows immediately from ( 22) and (136).The proof of (138) follows from Lemma 12. Finally, the proof of (139) follows by observing that a Taylor expansion of the function K Q,z in ( 22) at the point − 1 λ , yields for all Let s ⋆ ∈ R ∪ {+∞} be defined by which implies that Hence, in this case, the equality in (139) is of the form +∞ = +∞.This completes the proof.
Proof: The proof of (149) follows from (107) in Lemma 20 by observing that for all t ∈ (0, +∞), where the equality in (153) follows from Lemma 13.The proof of (150) follows from (108) in Lemma 20 by observing that for all t ∈ (0, +∞), where the equality in (155) follows from Lemma 13, which completes the proof.
From Lemma 15 and Lemma 22, it follows that the function J z,Q,λ in (135) is increasing and differentiable infinitely many times in the interior of From Lemma 22, it follows that for all m ∈ N, and for all α ∈ R, the following holds, where the function Q,z denotes the m-th derivative of the function K Q,z in (22).See for instance, Lemma 17.The equality in (157) establishes a relation between the cumulant generating function J z,Q,λ and the function K Q,z .This observation becomes an alternative proof to Lemma 17.
The following theorem presents the relation between the cumulant generating function J z,Q,λ and the functions K Q,z in (96) and (97).Theorem 7: For all α ∈ R, the function J z,Q,λ in (135) verifies the following equality with where the functions K Q,z and K Q,z are defined in (96) and (97), respectively.
Proof: From Lemma 15, it follows that the function K Q,z is differentiable infinitely many times in the interior of {t ∈ R : K Q,z (t) < +∞}.Then, a Taylor expansion of the function K Q,z in (22) at the point − 1 λ yields for all t ∈ {ν ∈ R : where The proof is completed by noticing that from Lemma 22, it holds that In (158), the term ξ α depends on α via (159).The focus is now on the term Theorem 8: The function J z,Q,λ in (135) verifies the following inequality for all α ∈ {t ∈ R : J z,Q,λ (t) < +∞}, where and the functions K Q,z and K Q,z are defined in ( 96) and (97), respectively.
Proof: From Lemma 22, it holds that which implies that the set {t ∈ R : J z,Q,λ (t) < +∞} contains an interval of the form (−∞, b) and is contained within an interval of the form (−∞, b], with b in (164).This follows from the fact that the function K Q,z is continuous and nondecreasing (Lemma 15).Note also that K Q,z (0) = 0, and thus, where σ α is Q,z (ξ):ξ∈ In the asymptotic regime, when α → −∞, it holds that and when α → b, it holds that Thus, for all α ∈ (−∞, b), where the function L z is defined by (3); the functional R z is defined by (18); and the probability measure Θ|Z=z is in (25).This section introduces two results.First, in Theorem 9, it is shown that when λ tends to zero, the set N Q,z (λ) forms an indexed family of sets that is monotonic and decreases to the set where δ ⋆ Q,z is defined in (37); and the set L z (δ ⋆ Q,z ) is defined in (36).Second, in Theorem 10, it is shown that the probability P (Q,λ) Θ|Z=z (N Q,z (λ)) strictly increases when λ tends to zero.More importantly, in Theorem 11, it is shown that the limit of the probability P (Q,λ) Θ|Z=z (N Q,z (λ)), when λ → 0, is equal to one.These observations justify referring to the set N ⋆ Q,z as the limit set.These observations are complementary to those stated in Section III-B and Section III-C.This section ends by showing that the probability measure At the light of this observation, the set L ⋆ Q,z is referred to as the nonnegligible limit set.Finally, it is shown that when the σfinite measure Q in ( 19) is coherent, the sets N ⋆ Q,z and L ⋆ Q,z are identical.

A. The Limit Set
The set N Q,z (λ) in (166), with λ ∈ K Q,z and K Q,z in (23), contains all the models that induce an empirical risk that is smaller than or equal to R z Ä P (Q,λ) Θ|Z=z ä , i.e., the ERM-RERoptimal expected empirical risk in (104).This observation unveils the existence of a relation between the set N ⋆ Q,z in (167) and the set T (z) in ( 5), as shown by the following lemma.
where the set T (z) is in (5).Moreover, if and only if (a) the ERM problem in (4) possesses a solution; and (b) the reference measure Q in ( 19) is coherent.
The proof of the equality in (169) is presented in two parts.In the first part, it is proved that if (169) holds, then the ERM problem in (4) possesses a solution and the measure Q is coherent.The second part proves the converse.The proof of the first part is as follows.Under the assumption that (47), which implies that the ERM problem in (4) possesses a solution.
Moreover, for all δ ∈ (ρ ⋆ , +∞), it holds that Q (L z (δ)) > 0, which verifies that the measure Q is coherent and completes the proof of the first part.The proof of the second part is as follows.Under the assumption that the ERM problem in (4) possesses a solution and the measure Q is coherent, it follows that δ ⋆ Q,z = ρ ⋆ .Hence, T (z) = N ⋆ Q,z , which completes the proof of the second part.
The following theorem highlights that the set N Q,z (λ) is decreasing with λ.
with N ⋆ Q,z being the set defined in (167).Moreover, if the empirical risk function L z in (3) is continuous on M and separable with respect to the measure Q in (19), then, Proof: The proof is presented in Appendix Q.
An interesting observation is that for all λ ∈ K Q,z , with K Q,z in ( 23), only a subset of N Q,z (λ) might exhibit nonzero probability with respect to the measure P Θ|Z=z in (25).Consider for instance that the measure in (37) and ρ ⋆ in (47).Thus, for all γ ∈ Ä ρ ⋆ , δ ⋆ Q,z ä , it holds that Q (L z (γ)) = 0, with the set L z (•) in (36).From Lemma 3, this implies that for all γ ∈ Θ|Z=z in (25) satisfies Θ|Z=z (L z (γ)) = 0, while verifying that L z (γ) ⊆ N Q,z (λ).These observations lead to the analysis of the asymptotic concentration of probability in the following section.

B. The Nonnegligible Limit Set
The first step in the analysis of the asymptotic concentration of the probability measure P (Q,λ) Θ|Z=z in ( 25) is to show that the probability P (Q,λ) Θ|Z=z (N Q,z (λ)) increases when λ tends to zero, as shown by the following theorem.23) and λ 1 > λ 2 , assume that the measures P (Q,λ1) Θ|Z=z and P (Q,λ2) Θ|Z=z satisfy (25) with λ = λ 1 and λ = λ 2 , respectively.Then, the set N Q,z (λ 2 ) in (166) satisfies where strict inequality holds if and only if the function L z is separable with respect to the σ-finite measure Q.
Proof: The proof is presented in Appendix R.
The following lemma highlights a case in which a stronger concentration of probability is observed.
Lemma 25: Let the function L z in (3) be separable with respect to the σ-finite measure Q in (19).Let also (23), be two positive reals such that λ 1 > λ 2 and with the complement with respect to the set of models M.
Proof: The proof is presented in Appendix S.
The following example shows the relevance of Lemma 25 in the case in which the empirical risk function L z in (3) is a simple function and separable with respect to the σ-finite measure Q in (19).
Proof: The proof follows immediately from Lemma 7 and by noticing that for all λ ∈ K Q,z , with K Q,z in (23), the sets L ⋆ Q,z in (38) and Note that Theorem 11 and Lemma 7 lead to the following conclusion which follows from the fact that in (38).This justifies referring to the set L ⋆ Q,z as the nonnegligible limit set.

X. (δ, ϵ)-OPTIMALITY
This section introduces a PAC guarantee of optimality for the models that are sampled from the probability measure P (Q,λ) Θ|Z=z in (25) with respect to the ERM problem in (4).Such guarantee is defined as follows.
Definition 6 ((δ, ϵ)-Optimality): Given a pair of positive reals (δ, ϵ), with ϵ < 1, the probability measure P (Q,λ) Θ|Z=z in ( 25) is said to be (δ, ϵ)-optimal, if the set L z (δ) in (36) satisfies If the probability measure Θ|Z=z in ( 25) is (δ, ϵ)-optimal, then it assigns a probability that is always greater than 1 − ϵ to a set that contains models that induce an empirical risk that is smaller than δ.From this perspective, particular interest is given to the smallest δ and ϵ for which The main result of this section is presented by the following theorem.
Proof: Let δ be a real in (37).Let also γ ∈ K Q,z satisfy the following equality: Note that from Lemma 15, it follows that the function K is continuous.Moreover, from Theorem 6, it follows that such a γ in (182) always exists.From ( 36) and (166), it holds that and thus, Let λ be a positive real such that λ ⩽ γ and The existence of such a positive real λ follows from Theorem 11.Hence, from (185), it holds that, where the inequality in (187) follows from the fact that N Q,z (λ) ⊆ N Q,z (γ) ⊆ L z (δ).Finally, the inequality in (187) implies that the probability measure Θ|Z=z is (δ, ϵ)optimal (Definition 6).This completes the proof.
A stronger optimality claim can be stated when the reference measure is coherent.
Proof: The proof is divided into two parts.The first part shows that if for all (δ, ϵ) ∈ (ρ ⋆ , +∞) × (0, 1), there always exists a λ ∈ K Q,z , with K Q,z in (23), such that the probability measure P (Q,λ) Θ|Z=z in ( 25) is (δ, ϵ)-optimal, then, the measure Q is coherent.The second part deals with the converse.
The first part is as follows.Let γ ∈ K Q,z be such that then, for all measurable subsets A of L z (δ), it holds that which, together with Lemma 2, implies that there exists at least one measurable subset A for which Q (A) > 0, and thus, which implies that the measure Q is coherent.This completes the first part of the proof.
The second part of the proof is as follows.Under the assumption that the measure Q is coherent, it follows that δ ⋆ Q,z = ρ ⋆ .Then, from Theorem 12, it follows that for all (δ, ϵ) ∈ (δ ⋆ Q,z , +∞) × (0, 1), there always exists a λ ∈ K Q,z , with K Q,z in (23), such that the probability measure P (Q,λ) Θ|Z=z is (δ, ϵ)-optimal.This completes the second part of the proof.

XI. SENSITIVITY AND GENERALIZATION
This section introduces the notion of sensitivity and establishes its connections with the notion of generalization error of the Gibbs algorithm, cf.[9].

A. Sensitivity
The sensitivity of the expected empirical risk R z in (18) to deviations from the probability measure P Θ|Z=z in (25) towards an alternative probability measure P ∈ △ (M, B (M)) is introduced as a novel metric to evaluate the generalization capabilities of the ERM-RER-optimal measure P (Q,λ) Θ|Z=z .Deviations from the probability measure P (Q,λ) Θ|Z=z towards an alternative probability measure P would allow comparing the ERM-RER-optimal measure with alternative measures (or algorithms).For instance, if new datasets become available, a new ERM-RER problem can be formulated using a larger dataset obtained by aggregating the old and the new datasets, cf.[11] and [72].Intuitively, the ERM-RER-optimal measure obtained after the aggregation of datasets might exhibit better generalization capabilities, see for instance [11].This analysis is the motivation of the sensitivity, which is defined as follows.
Definition 7 (Sensitivity): Given the σ-finite measure Q and the positive real λ > 0 in (19) where the functional R z is defined in (18) and the probability measure Θ|Z=z is in (25).The sensitivity of the expected empirical risk R z due to a deviation from P (Q,λ) Recently, the following exact expression for the sensitivity S Q,λ (z, P ) in (191) was introduced in [11].
Theorem 14 (Theorem 1 in [11]): The sensitivity S Q,λ (z, P ) in (191) satisfies where the probability measure Θ|Z=z is in (25).The following theorem introduces an upper bound on the absolute value of the sensitivity S Q,λ (z, P ) in (191), which requires the calculation of only one of the relative entropies in Theorem 14.

Theorem 15:
where the constant β Q,z is defined in (163).
Proof: The proof is presented in Appendix T.
Note that equality holds in (193) in the trivial case in which the empirical risk function is not separable with respect to Q (Definition 5).In such case, for all P ∈ △ Q (M, B (M)), it holds that S Q,λ (z, P ) = 0 and β Q,z = 0.
Theorem 15 establishes an upper and a lower bound on the increase and decrease of the expected empirical risk that can be obtained by deviating from the optimal solution of the ERM-RER problem in (19).More specifically, note that for all probability measures

B. Generalization Error
This section unveils the interesting connection between the notion of sensitivity and the notion of generalization error of the Gibbs algorithm, cf.[9].The Generalization error is defined under the assumption that datasets are sampled from a probability measure where F denotes a given σ-field on the set (X × Y) n .For such a probability measure P Z in (196), let the set where the σ-finite measure Q is in (19).The set K Q,P Z in (197) can be empty for some choices of the σ-finite measure Q.Nonetheless, from Lemma 1, it follows that if Q is a probability measure, then, Under the assumption that datasets are sampled from P Z in (196), the generalization error of the Gibbs algorithm with parameters Q and λ, is defined as the expectation with respect to the product measure P (Q,λ) Θ|Z in (25), of difference between: (a) the population risk due to a model θ ∈ M, with the function L z defined in (3); and (b) The empirical risk induced by the model θ with respect to a training dataset z, that is, L z (θ).More specifically, the generalization error of the Gibbs algorithm with parameters Q and λ is where the probability measure P (Q,λ) Θ satisfies for all sets A ∈ B (M), and the functional R ν is defined in (18).
The following theorem establishes a connection between sensitivity and generalization error in the particular case in which Q in ( 19) is a probability measure.
Theorem 16: Under the assumption that datasets are sampled from P Z in (196), the generalization error of the Gibbs algorithm with parameters Q (a probability measure) and λ > 0, is where the functional S Q,λ is in (191); and the probability measure Proof: The proof uses the fact that under the assumption that Q is a probability measure, for all ν ∈ supp P Z , it follows from Lemma 1 that K Q,ν = (0, +∞).This implies that for all z ∈ supp P Z and for all λ > 0, the ERM-RER problem in (19), always possesses as solution the measure P (Q,λ) Θ|Z=z in (25).Thus, the measure P and the integral in ( 202) is also well defined, which completes the proof.
Theorem 16 provides an interesting viewpoint of the generalization error.For instance, the probability measure P (Q,λ) Θ in (196) can be understood as the barycenter of a subset of △ (M, B (M)) containing the solutions to ERM-RER problems of the form in (19), with z ∈ supp P Z in (196).Hence, the generalization error of the Gibbs algorithm is the expectation (with respect to P Z ) of the sensitivity of the expected empirical risks R z in (18) to variations from the ERM-RER-optimal measure P (Q,λ) Θ|Z=z towards the barycenter, i.e., the measure P The following definition extends the notion of generalization error to Gibbs algorithms obtained by assuming that the reference measure Q in ( 19) is a σ-finite measure.This definition also exploits the relation between the notions of sensitivity and generalization error introduced by Theorem 16.

Definition 8 (Generalization Error of the Gibbs Algorithm):
be such that where the functional S Q,λ is in (191); the set K Q,P Z is in (197); and the probability measure is in (203).The generalization error induced by the Gibbs algorithm with parameters Q and λ under the assumption that datasets are sampled from the probability measure P Z , is G Q,λ (P Z ).
The main difficulty for extending the notion of generalization error to Gibbs algorithms obtained under the assumption that the reference measure is not a probability measure, but a σfinite measure, is that the integrals in ( 202) and (203) might not be well defined.This is essentially due to the fact that, while the ERM-RER problem in (19) always possesses a solution when Q is a probability measure, the existence of a solution when Q is not a probability measure is subject to the condition that for all z ∈ supp P Z , λ ∈ K Q,z , with K Q,z in (23).This leads to the condition that λ ∈ K Q,P Z , with the set K Q,P Z in (197).When such a condition is not met, the definition of sensitivity is void.
The following theorem provides a closed-form expression for the generalization error of the Gibbs algorithm in the general case in which the reference measure Q in ( 19) is a σ-finite measure.
where for all z ∈ supp P Z , the probability measure Θ|Z=z is in (25); and the probability measure Proof: The proof is presented in Appendix U.
The terms in the right-hand side of (206) are respectively the mutual and the lautum information [56] induced by a joint probability measure P Θ,Z whose marginals are P Z in (196) and 19) is a probability measure, Theorem 17 reduces to [9, Theorem 1].Interestingly, independently of whether the reference measure Q in ( 19) is a probability measure, or whether the n data points in the datasets are independent and identically distributed, the generalization error G Q,λ (P Z ) in ( 205) is always a factor of the sum of the mutual and lautum information induced by the joint probability measure P Θ,Z mentioned above.
Theorem 17 also provides an alternative interpretation of the generalization error G Q,λ (P Z ) in (205).Note that by writing one of the factors in the right-hand side of (206) as it becomes clear that G Q,λ (P Z ) is the expectation with respect to P Z of the symmetrized Kullback-Leibler divergence, also known as Jeffrey's divergence [69], of the probability measures Θ|Z=z and P . That is, the solution to the ERM-RER problem in (19) and the barycenter induced by P Z .
The following theorem provides an upper-bound on the generalization error of the Gibbs algorithm only in terms of the lautum information induced by such a joint probability measure P Θ,Z .
where for all z ∈ supp P Z , the probability measure is in (25); the probability measure is defined in (203); and with β Q,z in (163).
Proof: The proof of the inequality G Q,λ (P Z ) ⩾ 0 follows from observing that for all ν ∈ (X × Y) n , the terms nonnegative (Theorem 1).The proof of the remaining inequality follows from (205) and the following inequalities: where the equality in (209) follows from (205); the inequality in (210) follows from [59, Theorem 1.5.9(c)]; the inequality in (211) follows from Theorem 15; the inequality in (212) follows from (208); and the inequality in (213) follows from Jensen's inequality [59,Section 6.3.5].This completes the proof.
In a nutshell, the generalization error G Q,λ (P Z ) in ( 205) is upper bounded up to a constant factor by the square root of the lautum information induced by the joint probability measure P Θ,Z mentioned above.Theorem 18 is reminiscent of [33, Theorem 1], which provides a similar upper-bound on G Q,λ (P Z ) using the mutual information instead of the lautum information induced by the joint probability measure P Θ,Z .The interest in Theorem 18 for the specific case of the Gibbs algorithm, lies on the fact that it holds under milder conditions than those in [33, Theorem 1].For instance, no additional conditions on the loss function ℓ in (2) concerning sub-Gaussianity are assumed.Moreover, the probability measure P Z from which datasets are sampled is not necessarily a product measure.

XII. CONCLUSIONS AND FINAL REMARKS
The classical ERM-RER problem in (19) has been studied under the assumption that the reference measure Q is a σfinite measure, instead of a probability measure, which leads to a more general problem that includes the ERM problem with (discrete or differential) entropy regularization and the information-risk minimization problem.While in the case in which the reference measure is a probability measure the solution to the ERM-RER problem always exists, in this general case, the existence of a solution is subject to a condition that depends on the loss function, the reference measure, the regularization factor, and the training dataset.When a solution exists, it has been proved that it is unique.Additionally, if it exists, such a solution and the reference measure are mutually absolutely continuous.Interestingly, the empirical risk observed when models are sampled from the ERM-RERoptimal probability measure is a sub-Gaussian random variable that exhibits a PAC guarantee for the ERM problem.That is, for some positive δ and ϵ, it is shown that there always exist some parameters for the ERM-RER problem such that the set of models that induce an empirical risk smaller than δ exhibits a probability that is not smaller that 1−ϵ.Interestingly, none of these results relies on statistical assumptions on the datasets.
The sensitivity of the expected empirical risk to deviations from the ERM-RER-optimal measure to alternative measures is introduced as a new performance metric to evaluate the generalization capabilities of the Gibbs algorithm.In particular, an upper bound on the absolute value of the sensitivity, which depends on the training dataset, is presented.This bound is formed by a constant factor and the square root of the relative entropy of the alternative measure (the deviation) with respect to the ERM-RER solution.Finally, it is shown that the expectation of the sensitivity (with respect to the datasets) to deviations towards a particular measure is equivalent to the generalization error of the Gibbs algorithm.Equipped with this observation, the generalization error is shown to be in the most general case, up to a constant factor, the sum of the mutual and lautum information between the models and the datasets, which was a result known exclusively for the case in which the reference is a probability measure, cf.[9].From this perspective, it is argued that the study of the generalization capabilities of the Gibbs algorithm based on generalization error is a significantly narrow view.This is essentially because it is looking at an expectation of the sensitivity to deviations to a particular measure, i.e., the barycenter of the set of ERM-RER solutions induced by a prior on the datasets.A broader view is offered by the study of the sensitivity to deviations towards other measures, i.e., ERM-RER-optimal measures obtained with different training data sets.This approach has lead already to a few initial results in [11] that highlight the connections to sensitivity, training error, and test error.Nonetheless, the study of the sensitivity in the aim of describing the generalization capabilities of learning algorithms remains by now as an open problem.

APPENDIX A PROOF OF THEOREM 2 Consider the function
and note that it is strictly convex.From the assumption that for all i ∈ {1, 2}, P i and Q i are both measures on the same measurable space (Ω, F ), with P i absolutely continuous with respect to Q i , let g : Ω → [0, ∞) be the function where Using this notation, for all λ ∈ (0, 1), where the function f is defined in (214).Let β 1 and β 2 be the following constants: From ( 217) and (218), it follows that for all λ ∈ (0, 1), where the inequalities in ( 219) and ( 222) follow from Jensen's inequality [59, Section 6.3.5] and the fact that the function f in (217) is strictly concave.Note that from (218), in (219), for all i ∈ {1, 2}, g(x) βi dQ i (x) = 1; while in (222), Given the strict convexity of the function f in (214), equality in ( 219) and ( 222) hold if and only if This completes the proof.

APPENDIX B PROOF OF LEMMA 1
The proof is divided into two parts.The first part is as follows.Under the assumption that the set K Q,z in ( 23) is empty, there is nothing to prove.Alternatively, under the assumption that the set K Q,z is not empty, there always exists a real with L z in (3).Thus, from (22), it follows that Hence, if b ⋆ = +∞, it follows from (23) that Alternatively, if b ⋆ < +∞, it holds that In either case, it follows that K Q,z is a convex set.This completes the first part of the proof.
The second part of the proof is under the assumption that Q is a probability measure.Under this assumption, for all θ ∈ M and for all for all t > 0, it follows that exp with L z in (3).Thus, which implies that (0, +∞) ⊆ K Q,z .Thus, if Q is a probability measure, from (23), it holds that K Q,z = (0, +∞), which completes the proof.

APPENDIX C PROOF OF THEOREM 3
The optimization problem in (19) can be re-written in terms of the Radon-Nikodym derivative of the optimization measure P with respect to the measure Q, denoted by dP dQ : M → [0, ∞), which yields: The remainder of the proof focuses on the problem in which the optimization is over the function dP dQ instead of the measure P .This is due to the fact that for all P ∈ △ Q (M), the Radon-Nikodym derivate dP dQ is unique up to sets of zero measure with respect to the measure Q.Let M be the set of measurable functions M → R with respect to the measurable spaces (M, B (M)) and (R, B (R)) that are absolutely integrable with respect to Q.That is, for all ĝ ∈ M , it holds that |ĝ(θ)| dQ(θ)<∞. (238) Hence, the optimization problem of interest is: Let the Lagrangian of the optimization problem in (239) be the functional where β is a real that acts as a Lagrangian multiplier due to the constraint (239b).Let ĝ : M → R be a function in M .The Gateaux differential of the functional L in (240) at The proof continues under the assumption that the functions g and ĝ are such that the Gateaux differential in (241) exists.
Under such an assumption, let the function r : R → R satisfy for all α ∈ (−ϵ, ϵ), with ϵ arbitrarily small, that where the last equality is simply an algebraic re-arrangement of terms.From the assumption that the functions g and ĝ are such that the Gateaux differential in (241) exists, it follows that the function r in ( 243) is differentiable at zero.Note that the first two terms in (243) are independent of α; the third term is linear with α; and the fourth term can be written using the function r : R → R such that for all α ∈ (−ϵ, ϵ), with ϵ arbitrarily small, satisfies where f : (0, +∞) → R is such that f (t) = t log(t).Under the same assumption, it follows that the function r in ( 244) is differentiable at zero.That is, the limit exists for all γ ∈ (−ϵ, ϵ), with ϵ arbitrarily small.Note that the function f in (245) is continuous and differentiable (with finite derivate) in (0, +∞).Thus, the function f is also Lipschitz continuous.This implies that for all θ ∈ supp Q, and for all γ ∈ (−ϵ, ϵ), with ϵ > 0 arbitrarily small, it holds that with δ > 0, for some constant c positive and finite.This implies that Using these arguments, the limit in (246) satisfies for all γ ∈ (−ϵ, ϵ), with ϵ > 0 arbitrarily small, that where the function ḟ : (0, +∞) → R is the derivative of f .That is, ḟ (t) = 1 + log(t).The equality in (249) and the inequality in (250) follow from noticing that the conditions for the dominated convergence theorem hold [59, Theorem 1.6.9],namely: • For all γ ∈ (−ϵ, ϵ), with ϵ > 0, the inequality in (248) holds; • The function ĝ in (248) satisfies the inequality in (238); and • For all θ ∈ supp Q and for all γ ∈ (−ϵ, ϵ), with ϵ > 0 arbitrarily small, it holds that Hence, the derivative of the real function r in ( 243) is From ( 241) and ( 253), it follows that The relevance of the Gateaux differential in (254) stems from [73, Theorem 1, page 178], which unveils the fact that a necessary condition for the functional L in (240) to have a minimum at From (255), it follows that which implies that for all ν ∈ supp Q, and thus, dP with β chosen to satisfy (237b).That is, The proof continues by verifying that the measure P Θ|Z=z that satisfies (258) is the unique solution to the ERM-RER problem in (19).Such verification is done by showing that the objective function in (19) is strictly convex with the optimization variable.Let P 1 and P 2 be two different probability measures in (M, B (M)) and let α be in (0, 1).Hence, where the functional R z is defined in (18).The equality above follows from the properties of the Lebesgue integral, while the inequality follows from Theorem 2. This proves that the solution is unique due to the strict concavity of the objective function, which completes the proof.

APPENDIX D PROOF OF LEMMA 2
From Theorem 3, it follows that for all θ ∈ supp Q, where the inequality in (261) follows from the fact that the function L z is nonnegative; and the equality in (262) follows from the fact that λ ∈ K Q,z .This completes the proof of finiteness.
The proof of positivity follows from observing that λ ∈ K Q,z and thus, K Q,z − 1 λ < +∞.Moreover, for all θ ∈ supp Q, it holds that L z (θ) ⩽ +∞, with equality if and only if L z (θ) = +∞.These two observations put together yield with equality if and only if L z (θ) = +∞.The proof continues by showing that Q ({θ ∈ M : L z (θ) = +∞}) = 0. From the assumption that λ ∈ K Q,z , it follows that where the inequality in (266) holds from the fact that for all x ⩾ 0, exp(x) > 1 + x.Hence, the inequality in (267) implies that Q (M) < +∞ and (268) Finally, from [59, Lemma 1.6.6]and the inequality in (269) it follows that Q ({θ ∈ M : L z (θ) = +∞}) = 0. Hence, almost surely with respect to the measure Q.And this completes the proof.

APPENDIX E PROOF OF LEMMA 3
The probability measure P Θ|Z=z in (25) satisfies for all C ∈ B (M), and thus, if Q (C) = 0, then which implies the absolute continuity of P Θ|Z=z with respect to Q.
Alternatively, given a set C ∈ B (M), assume now that P (Q,λ) Θ|Z=z (C) = 0. Hence, it follows that if and only if Q (C) = 0.This verifies the absolute continuity of Q with respect to P Θ|Z=z , and completes the proof.
Consider a measure P on (M, B (M)), such that for all sets A ∈ B (M), and note that if Θ|Z=z (A) = 0, then P (A) = 0.This implies that P is absolutely continuous with respect to P (Q,β) Θ|Z=z (A).Moreover, from (277), it follows that which implies that the probability measures P in (277) and Θ|Z=z are identical.Thus, Θ|Z=z is absolutely continuous with respect to P (Q,β) Θ|Z=z .The proof that P Θ|Z=z is absolutely continuous with respect to P (Q,α) Θ|Z=z follows the same argument.This completes the proof.

APPENDIX G PROOF OF LEMMA 6
From Theorem 3, the probability measure P (Q,λ) Θ|Z=z in (25) satisfies for all θ ∈ supp Q, Given θ ∈ supp Q, consider the partition of supp Q formed by the sets A 0 (θ), A 1 (θ), and A 2 (θ), which satisfy the following: Using the sets A 0 (θ), A 1 (θ), and A 2 (θ) in (285), the following holds for all θ ∈ supp Q, Note that the sets with δ ⋆ Q,z in (37), form a partition of the set supp Q.Following this observation, the rest of the proof is divided into three parts.The first part evaluates lim λ→0

©
. The second part considers the case . The third part considers the remaining case.
The first part is as follows.Consider that θ Hence, the sets A 0 (θ), A 1 (θ), and A 2 (θ) in (286) satisfy the following: From the definition of δ ⋆ Q,z in (37), it follows that Q (A 2 (θ)) = 0. Plugging the equalities in (292) in (288) yields for all θ ∈ The equality in (293) implies that for all θ where the equality in (295) follows from verifying that the dominated convergence theorem [59, Theorem 2.6.9]holds.That is, (a) For all ν ∈ A 1 (θ), it holds that exp This completes the first part of the proof.
The second part is as follows.For all δ > δ ⋆ Q,z and for all θ ∈ ν ∈ supp Q : L z (ν) = δ , the sets A 0 (θ), A 1 (θ), and A 2 (θ) in (286) satisfy the following: Consider the sets and note that A 2,1 (θ) and A 2,2 (θ) form a partition of A 2 (θ).Moreover, from the definition of δ ⋆ Q,z in (37), it holds that Hence, plugging the equalities in (297) and (300) in (288) yields, for all δ The equality in (301) implies that for all δ > δ ⋆ Q,z and for all θ ∈ ν ∈ M: L where the equality in (303) follows by verifying that the dominated convergence theorem [59, Theorem 2.6.9]holds.That is, (a) For all ν ∈ A 1 (θ), it holds that exp This completes the second part.
The third part of the proof follows by noticing that the set ν is a negligible set with respect to Q and thus, for all θ ∈ ν ∈ This completes the third part and completes the proof.

APPENDIX H PROOF OF LEMMA 7
Consider the following partition of the set M formed by the sets Let h : M → R be a function in M .The Gateaux differential of the functional L in (324) at (g, α, β) ∈ M × [0, +∞) 2 in the direction of h, if it exists, is where the real function r : R → R is such that for all γ ∈ (−ϵ, ϵ), with some ϵ > 0, satisfies The proof continues under the assumption that the functions g and h are such that the Gateaux differential in (325) exists.That is, the function r in (326) is differentiable in (−ϵ, ϵ), with some ϵ > 0. Using the same arguments as in the proof of Theorem 3, it follows that the derivative of the real function r in (326) is From ( 325) and (327), it follows that From [73, Theorem 1, page 217], it holds that a necessary condition for the functional L in (324) to have a minimum at which implies that for all ν ∈ M, Thus, where α and β are chosen to satisfy their corresponding constraints with equality.Denote by P ⋆ the solution of the optimization problem in (82).Hence, from (331), it follows that where α is chosen to satisfy From Lemma 3, it follows that the probability measure P ⋆ and the σ-finite measure Q satisfy, which implies that P ⋆ is a Gibbs probability measure on (M, B (M)), with energy function L z , reference measure Q, and regularization parameter 1 , where α is chosen to satisfy (333).Let the positive real ω be ω ≜ αλ α+λ and note that ω ∈ (0, λ] and satisfies D The proof ends by verifying that the objective function in (324) is strictly convex, and thus, the measure P (Q,ω) Θ|Z=z is the unique minimizer.This completes the proof.

APPENDIX K PROOF OF LEMMA 15
Note that for all (λ 1 , λ < +∞, which proves that the function is nondecreasing.
The proof of continuity of the function K Q,z follows from observing that for all α ∈ {x ∈ R : where ( 339) and (341) follow from the fact that both the logarithmic and exponential functions are continuous; and the equality in (340) follows from the monotone convergence theorem [59,Theorem 1.6.2].This shows that the function The proof of differentiability follows by considering the transport of the σ-finite measure Q in ( 22) from the measure space (M, B (M)) to the measure space ([0, +∞) , B ([0, +∞))) through the function L z in (3).Denote the resulting measure in ([0, +∞) , B ([0, +∞))) by P .More specifically, for all A ∈ B ([0, +∞)), it holds that where the equality (344) follows from [59, Theorem 1.6.12].
Denote by ϕ the Laplace transform of the measure P .That is, for all t ∈ {x ∈ R : K Q,z (x) < +∞}, Hence, ϕ(t) = exp (K Q,z (t)).From [74, Theorem 1a (page 439)], it follows that the function ϕ has derivatives of all orders in {x ∈ R : K Q,z (x) < +∞}, and thus, so does the function This completes the proof.

APPENDIX M PROOF OF LEMMA 17
For all s ∈ K Q,z , with K Q,z in (23), the equality in (95) implies the following, where the equality in (356) holds from the dominated convergence theorem [59, Theorem 1.6.9]; the equality in (358) follows from (22); and the equality in (360) follows from (25).
For all s ∈ K Q,z , with K Q,z in (23), the equalities in (95) and (359) imply that K = L z (θ) = L z (θ) where the equality in (362) follows from the dominated convergence theorem [59, Theorem 1.6.9]; the equality in (364) is due to a change of measure through the Radon-Nikodym derivative in (25); and the equality in (366) follows from (360).
This completes the proof.

APPENDIX N PROOF OF THEOREM 5
The proof is based on the analysis of the derivative of K (1) Q,z − 1 λ with respect to λ in intK Q,z .This is due to Corollary 2. For instance, note that where the equality in (380) follows from Lemma 17.The inequality in (381) implies that the expected empirical risk R z Ä P Q,z − 1 λ in (104) is nondecreasing with respect to λ.The rest of the proof consists in showing that for all α ∈ K Q,z , the function K (2) Q,z in (95) satisfies K (2) Q,z − 1 α > 0 if and only if the function L z in (3) is separable.For doing so, a handful of preliminary results are described in the following subsection.The proof of Theorem 5 resumes in Subsection N-B Lemma 27: For all α ∈ K Q,z , with K Q,z in (23), the measure P (Q,λ) Θ|Z=z in (25), satisfies if and only if where the sets R 1 (α) and R 2 (α) are in (382b) and (382c), respectively.

B. The proof
The rest of the proof of Theorem 5 is divided into two parts.In the first part, it is shown that if for all α ∈ K Q,z , K Q,z − 1 α > 0, then the function L z in (3) is separable.The second part of the proof, consists in showing that if the function L z is separable, then, for all α ∈ K Q,z , K Q,z − 1 α > 0. The first part is as follows.From Lemma 17, it holds that for all α ∈ K Q,z , where the sets R 0 (α), R 1 (α), and R 2 (α) are respectively defined in (382).Hence, Under the assumption that for all α ∈ K Q,z the function K (2) Q,z in (95) satisfies K Q,z − 1 α > 0, it follows that at least one of the following claims is true: Θ|Z=z and moreover, it holds that for all (ν 1 , ν 2 ) ∈ R 1 (α) × R 2 (α), where L z (ν 1 ) < +∞ follows from the fact that P (Q,λ) Θ|Z=z ({θ ∈ M : L z (θ) = +∞}) = 0 (Lemma 2).This proves that under the assumption that for all α ∈ K Q,z , K Q,z − 1 α > 0, the function L z in (3) is separable with respect to P (Q,α) Θ|Z=z .From Lemma 14, it holds that the function L z is separable with respect to Q.This completes the first part of the proof.
The second part of the proof is simpler.Assume that the empirical risk function L z in (3) is separable with respect to P (Q,α) Θ|Z=z .That is, for all γ ∈ K Q,z , there exist a positive real c γ > 0; and two subsets A(γ) and B(γ) of M that are nonnegligible with respect to P (Q,γ) Θ|Z=z in (25) and verify that for all (ν 1 , ν 2 ) ∈ A(γ) × B(γ), From Lemma 17, it holds that where the inequality (409) follows from the following facts.First, if c γ < K which implies, >0.
Second, if c γ ⩾ K which implies, >0. (415) Finally, the strict inclusion N Q,z (λ 2 ) ⊃ N ⋆ Q,z is proved by contradiction.Assume that there exists a λ ∈ K Q,z such that N ⋆ Q,z = N Q,z (λ).That is, Hence, three cases might arise: (a) there exists a λ ∈ K Q,z , such that δ ⋆ Q,z < K (1) and it holds that (b) there exists a λ ∈ K Q,z , such that δ ⋆ Q,z > K Q,z − 1 λ and it holds that or (c) there exists a λ ∈ K Q,z , such that δ ⋆ Q,z = K (1) Q,z − 1 λ .The cases (a) and (b) are absurd.Hence, the proof is complete only by considering the case (c).In the case (c), it holds that, and from the definition of δ ⋆ Q,z in (37), it holds that From Lemma 26 and (438), it follows that, Finally, by noticing that Θ|Z=z (R 0 (λ)) + P (Q,λ) reveals a contradiction to the assumption that the function L z is separable with respect to P (Q,λ) Θ|Z=z (and thus, separable with respect to Q by Lemma 14).This completes the proof of (171).

APPENDIX R PROOF OF THEOREM 10
The proof of (172) is based on the analysis of the derivative of P Note that for all α ∈ N Q,z (λ 2 ), it holds that for all γ ∈ (λ 2 , λ 1 ), Θ|Z=z satisfy(25) with λ = α and λ = β, respectively.Then,P (Q,α) Θ|Z=z and P (Q,β)Θ|Z=z are mutually absolutely continuous.Proof: The proof is presented in Appendix F.

Lemma 5 :
Let the measure Q in (19) be a probability measure.Then, for all θ ∈ supp Q, the Radon-Nikodym derivative dP

2 Å
(Q,λ) Θ|Z=z (A) with respect to λ, for some fixed set A ⊆ B (M).More specifically, given a γ ∈ K Q,z , it holds that P fundamental theorem of calculus [70, Theorem 6.21], it follows that for all (λ 1 , λ2 ) ∈ K Q,z × K Q,z with λ 1 > λ 2 , in(444) follows from (442); and the equality in (445) holds from Lemma 2 and the dominated convergence theorem [59, Theorem 1.6.9].For all θ ∈ supp Q, the following holds, L z (α)− L z (ν)dP and only if the function L z in (3) is separable with respect to the measure Q.
Proof: The proof is presented in Appendix N.The expected empirical risk R z Ä P (Q,λ) Θ|Z=z ä in (104) has been shown to be nondecreasing with λ in [9, Appendix E.4] for the special case in which Q is a probability measure.
(25)5), the empirical risk with respect to the dataset z is a sub-Gaussian random variable with sub-Gaussianity parameter β Q,z .The following corollary of Theorem 8 highlights this observation.Corollary 6: If β Q,z in (163) is finite, the random variable W in (99) is a sub-Gaussian random variable with sub-Gaussianity parameter β Q,z in (163).