On Pairs of $f$-Divergences and Their Joint Range

Harremoës, Peter; Vajda, Igor

doi:10.1109/tit.2011.2137353

Cited by 39 publications

(42 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[1,Lemma 6]. The improvement by a factor of 2 from (72) to (68) is also observed in [17,Remark 33], where the authors mention that our result [1, Theorem 9] (see (71)) in the conference version of this paper can be improved by a factor of 2 by using (68) instead (72). We believe the authors of [17] may have missed our result [1, Theorem 10] (see Corollary 9) in the conference paper, which presents precisely this improvement by a factor of 2.…”

Section: B Proofs Of Theorems 8 and 10supporting

confidence: 78%

“…It is worth mentioning that a systematic method of deriving optimal bounds between any pair of f -divergences is given by the Harremoës-Vajda joint range [71]. However, we cannot use this technique to derive lower bounds on KL divergence using χ 2 -divergence since no such general lower bound exists (when both input distributions vary) [21,Section 7.3].…”

Section: A Bounds On F -Divergences Using χ 2 -Divergencementioning

confidence: 99%

See 1 more Smart Citation

Bounds between contraction coefficients

Makur

Zheng

2015

2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton)

View full text Add to dashboard Cite

In this paper, we delineate how the contraction coefficient of the strong data processing inequality for KL divergence can be used to learn likelihood models. We then present an alternative formulation to learn likelihood models that forces the input KL divergence of the data processing inequality to vanish, and achieves a contraction coefficient equivalent to the squared maximal correlation. This formulation turns out to admit a linear algebraic solution. To analyze the performance loss in using this simple but suboptimal procedure, we bound these contraction coefficients in the discrete and finite regime, and prove their equivalence in the Gaussian regime.

show abstract

Section: B Proofs Of Theorems 8 and 10supporting

confidence: 78%

Section: A Bounds On F -Divergences Using χ 2 -Divergencementioning

confidence: 99%

Bounds between contraction coefficients

Makur

Zheng

2015

2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton)

View full text Add to dashboard Cite

show abstract

“…The most famous example is the Pinsker inequality [20], which shows that the KL divergence bounds from above the squared total deviation. More recently, the comprehensive studies of Sason and Verdú [8], Harremoës and Vajda [9] and Reid and Williamson [10] extended this result to a broader set of f -divergences inequalities. In addition, Zhang [21] introduced an important Bregman inequality in the context of Statistical learning; he showed that the KL divergence bounds from above the squared excess risk associated with the 0-1 loss, and by that controls this performance measure.…”

Section: Previous Workmentioning

confidence: 90%

Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss

Painsky

Wornell

2020

IEEE Trans. Inform. Theory

View full text Add to dashboard Cite

A loss function measures the discrepancy between the true values and their estimated fits, for a given instance of data. In classification problems, a loss function is said to be proper if the minimizer of the expected loss is the true underlying probability. In this work we show that for binary classification, the divergence associated with smooth, proper and convex loss functions is bounded from above by the Kullback-Leibler (KL) divergence, up to a normalization constant. It implies that by minimizing the log-loss (associated with the KL divergence), we minimize an upper bound to any choice of loss from this set. This property suggests that the log-loss is universal in the sense that it provides performance guarantees to a broad class of accuracy measures. Importantly, our notion of universality is not restricted to a specific problem. This allows us to apply our results to many applications, including predictive modeling, data clustering and sample complexity analysis. Further, we show that the KL divergence bounds from above any separable Bregman divergence that is convex in its second argument (up to a normalization constant). This result introduces a new set of divergence inequalities, similar to Pinsker inequality, and extends well-known f -divergence inequality results. I. INTRODUCTIONO NE of the major purposes of statistical analysis is making forecasts for future events and providing suitable guarantees associated with them. For example, consider a weather forecaster that estimates the chances of rain on the following day. Its performance may be evaluated by multiple statistical measures. We may count the number of times it assessed the chance of rain as greater than t = 50%, while it eventually did not rain (and vice versa). This corresponds to a 0-1 loss (as we later formally define). Alternatively, we may consider a variety of threshold values, t, or completely different measures (quadratic loss, Bernoulli log-likelihood loss, Boosting loss [2], etc.). Choosing a "good" measure is a well-studied problem, mostly in the context of scoring rules in decision theory [3]- [6]. Assuming that the desired measure is known in advance, the weather forecaster may be designed accordingly, to minimize that measure. However, in practice, different tasks may require to infer different information from the provided estimates. In such a case, designing a forecaster with respect to one measure may result in poor performance when evaluated by another. For example, the minimizer of a 0-1 loss may result in an unbounded loss, when measured with a Bernoulli log-likelihood loss. This means that ideally, a forecaster should be designed according to a "universal" measure that is "suitable" for a variety of purposes, and provide performance guarantees for different uses. This requirement is obviously challenging.In this work we address this problem, as we show that for binary classification, the Bernoulli log-likelihood loss (log-loss) is a "universal" choice which dominates any alternative "analytically convenient" (smooth, ...

show abstract

“…A more direct approach, in the spirit of the joint-range idea of Harremoës and Vajda [HV11], is to find (or bound) the best possible data-processing function F I defined as follows.…”

Section: Introductionmentioning

confidence: 99%

Strong Data Processing Inequalities for Input Constrained Additive Noise Channels

Calmon

Polyanskiy

2018

IEEE Trans. Inform. Theory

View full text Add to dashboard Cite

This paper quantifies the intuitive observation that adding noise reduces available information by means of non-linear strong data processing inequalities. Consider the random variables W → X → Y forming a Markov chain, where Y = X +Z with X and Z real-valued, independent and X bounded in L p -norm. It is shown that I(W ; Y ) ≤ F I (I(W ; X)) with F I (t) < t whenever t > 0, if and only if Z has a density whose support is not disjoint from any translate of itself.A related question is to characterize for what couplings (W, X) the mutual information I(W ; Y ) is close to maximum possible. To that end we show that in order to saturate the channel, i.e. for I(W ; Y ) to approach capacity, it is mandatory that I(W ; X) → ∞ (under suitable conditions on the channel). A key ingredient for this result is a deconvolution lemma which shows that post-convolution total variation distance bounds the pre-convolution KolmogorovSmirnov distance.Explicit bounds are provided for the special case of the additive Gaussian noise channel with quadratic cost constraint. These bounds are shown to be order-optimal. For this case simplified proofs are provided leveraging Gaussian-specific tools such as the connection between information and estimation (I-MMSE) and Talagrand's information-transportation inequality.

show abstract

On Pairs of $f$-Divergences and Their Joint Range

Abstract: We compare two f -divergences and prove that their joint range is the convex hull of the joint range for distributions supported on only two points. Some applications of this result are given.

Cited by 39 publications

References 12 publications

Bounds between contraction coefficients

Bounds between contraction coefficients

Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss

Strong Data Processing Inequalities for Input Constrained Additive Noise Channels

Contact Info

Product

Resources

About