A loss function measures the discrepancy between the true values and their estimated fits, for a given instance of data. In classification problems, a loss function is said to be proper if the minimizer of the expected loss is the true underlying probability. In this work we show that for binary classification, the divergence associated with smooth, proper and convex loss functions is bounded from above by the Kullback-Leibler (KL) divergence, up to a normalization constant. It implies that by minimizing the log-loss (associated with the KL divergence), we minimize an upper bound to any choice of loss from this set. This property suggests that the log-loss is universal in the sense that it provides performance guarantees to a broad class of accuracy measures. Importantly, our notion of universality is not restricted to a specific problem. This allows us to apply our results to many applications, including predictive modeling, data clustering and sample complexity analysis. Further, we show that the KL divergence bounds from above any separable Bregman divergence that is convex in its second argument (up to a normalization constant). This result introduces a new set of divergence inequalities, similar to Pinsker inequality, and extends well-known f -divergence inequality results.
I. INTRODUCTIONO NE of the major purposes of statistical analysis is making forecasts for future events and providing suitable guarantees associated with them. For example, consider a weather forecaster that estimates the chances of rain on the following day. Its performance may be evaluated by multiple statistical measures. We may count the number of times it assessed the chance of rain as greater than t = 50%, while it eventually did not rain (and vice versa). This corresponds to a 0-1 loss (as we later formally define). Alternatively, we may consider a variety of threshold values, t, or completely different measures (quadratic loss, Bernoulli log-likelihood loss, Boosting loss [2], etc.). Choosing a "good" measure is a well-studied problem, mostly in the context of scoring rules in decision theory [3]- [6]. Assuming that the desired measure is known in advance, the weather forecaster may be designed accordingly, to minimize that measure. However, in practice, different tasks may require to infer different information from the provided estimates. In such a case, designing a forecaster with respect to one measure may result in poor performance when evaluated by another. For example, the minimizer of a 0-1 loss may result in an unbounded loss, when measured with a Bernoulli log-likelihood loss. This means that ideally, a forecaster should be designed according to a "universal" measure that is "suitable" for a variety of purposes, and provide performance guarantees for different uses. This requirement is obviously challenging.In this work we address this problem, as we show that for binary classification, the Bernoulli log-likelihood loss (log-loss) is a "universal" choice which dominates any alternative "analytically convenient" (smooth, ...