Evaluation measures for quantification: an axiomatic approach

Sebastiani, Fabrizio

doi:10.1007/s10791-019-09363-y

Cited by 40 publications

(26 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For example, the same difference, in absolute value, between the true and the predicted prevalence values may have a different "cost" depending on the original true prevalence value: predicting 0.5 prevalence when the true prevalence is 0.49 can be considered, in some application contexts, a less blatant error than predicting a prevalence of 0.01 when the true prevalence is 0.00. In some other application contexts, though, the two above-mentioned estimation errors may be considered equally serious [29]. This means that sometimes we may want to use a certain evaluation measure and some other times we may want to use a different one.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

QuaPy: A Python-Based Framework for Quantification

Moreo

Esuli

Sebastiani

2021

Proceedings of the 30th ACM International Conference on Information &Amp; Knowledge Management

Self Cite

View full text Add to dashboard Cite

QuaPy is an open-source framework for performing quantification (a.k.a. supervised prevalence estimation), written in Python. Quantification is the task of training quantifiers via supervised learning, where a quantifier is a predictor that estimates the relative frequencies (a.k.a. prevalence values) of the classes of interest in a sample of unlabelled data. While quantification can be trivially performed by applying a standard classifier to each unlabelled data item and counting how many data items have been assigned to each class, it has been shown that this "classify and count" method is outperformed by methods specifically designed for quantification. QuaPy provides implementations of a number of baseline methods and advanced quantification methods, of routines for quantificationoriented model selection, of several broadly accepted evaluation measures, and of robust evaluation protocols routinely used in the field. QuaPy also makes available datasets commonly used for testing quantifiers, and offers visualization tools for facilitating the analysis and interpretation of the results. The software is opensource and publicly available under a BSD-3 licence via GitHub 1 , and can be installed via pip 2 . CCS CONCEPTS• Computing methodologies → Learning paradigms; • Information systems → Data mining.

show abstract

Section: Discussionmentioning

confidence: 99%

“…Several error measures have been proposed in the literature [29], and QuaPy implements a rich set of them:…”

Section: Error Measuresmentioning

confidence: 99%

QuaPy: A Python-Based Framework for Quantification

Moreo

Esuli

Sebastiani

2021

Proceedings of the 30th ACM International Conference on Information &Amp; Knowledge Management

Self Cite

View full text Add to dashboard Cite

show abstract

“…Even if not specifically focused on scales and their relationship to IR evaluation measures, there is a bulk of research on studying which constraints define the core properties of evaluation measures: Amigó et al [6,7,8,9] and Sebastiani [99] face this issue from a formal and theoretical point of view, applying it to various tasks such as ranking, filtering, diversity and quantification, while Moffat [77] adopts a more numerical approach.…”

Section: Related Workmentioning

confidence: 99%

Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales

Ferrante

Ferro²,

Fuhr

2021

IEEE Access

View full text Add to dashboard Cite

Information Retrieval (IR) is a discipline deeply rooted in evaluation since its inception. Indeed, experimentally measuring and statistically validating the performance of IR systems are the only possible ways to compare systems and understand which are better than others and, ultimately, more effective and useful for end-users. Since the seminal paper by Stevens [103], it is known that the properties of the measurement scales determine the operations you should or should not perform with values from those scales. For example, Stevens suggested that you can compute means and variances only when you are working with, at least, interval scales. It was recently shown that the most popular evaluation measures in IR are not intervalscaled. However, so far, there has been little or no investigation in IR on the impact and consequences of departing from scale assumptions. Taken to the extremes, it might even mean that decades of experimental IR research used potentially improper methods, which may have produced results needing further validation. However, it was unclear if and to what extent these findings apply to actual evaluations; this opened a debate in the community with researchers standing on opposite positions about whether this should be considered an issue (or not) and to what extent. In this paper, we first give an introduction to the representational measurement theory explaining why certain operations and significance tests are permissible only with scales of a certain level. For that, we introduce the notion of meaningfulness specifying the conditions under which the truth (or falsity) of a statement is invariant under permissible transformations of a scale. Furthermore, we show how the recall base and the length of the run may make comparison and aggregation across topics problematic. Then we propose a straightforward and powerful approach for turning an evaluation measure into an interval scale, and describe an experimental evaluation of the differences between the original measures and the intervalscaled ones. For all the regarded measures -namely Precision, Recall, Average Precision, (Normalized) Discounted Cumulative Gain, Rank-Biased Precision and Reciprocal Rank -we observe substantial effects, both on the order of average values and on the outcome of significance tests. For the latter, previously significant differences turn out to be insignificant, while insignificant ones become significant. The effect varies remarkably between the tests considered but on average, we observed a 25% change in the decision about which systems are significantly different and which are not. These experimental findings further support the idea that measurement scales matter and that departing from their assumptions has an impact. This not only suggests that, to the extent possible, it would be better to comply with such assumptions but it also urges us to clearly indicate when we depart from such assumptions and, carefully, point out the limitations of the conclusions we draw and under which conditions they are drawn.

show abstract

“…where p U andp U indicate the true class distribution and the predicted class distribution, resp., on the set U of unlabelled documents. The reason we use NAE is that, besides its simplicity, it is also (as argued in [35]) one of the theoretically most satisfying measures for evaluating the quality of class priors; NAE ranges between 0 (best) and 1 (worst). In all the tables of results that we include in Section 4, we compare the estimates of the class priors before applying SLD, computed by "classifying and counting", i.e., aŝ…”

Section: Evaluation Measuresmentioning

confidence: 99%

A Critical Reassessment of the Saerens-Latinne-Decaestecker Algorithm for Posterior Probability Adjustment

Esuli

Molinari

Sebastiani

2020

ACM Trans. Inf. Syst.

Self Cite

View full text Add to dashboard Cite

We critically re-examine the Saerens-Latinne-Decaestecker (SLD) algorithm, a well-known method for estimating class prior probabilities (“priors”) and adjusting posterior probabilities (“posteriors”) in scenarios characterized by distribution shift, i.e., difference in the distribution of the priors between the training and the unlabelled documents. Given a machine learned classifier and a set of unlabelled documents for which the classifier has returned posterior probabilities and estimates of the prior probabilities, SLD updates them both in an iterative, mutually recursive way, with the goal of making both more accurate; this is of key importance in downstream tasks such as single-label multiclass classification and cost-sensitive text classification. Since its publication, SLD has become the standard algorithm for improving the quality of the posteriors in the presence of distribution shift, and SLD is still considered a top contender when we need to estimate the priors (a task that has become known as “quantification”). However, its real effectiveness in improving the quality of the posteriors has been questioned. We here present the results of systematic experiments conducted on a large, publicly available dataset, across multiple amounts of distribution shift and multiple learners. Our experiments show that SLD improves the quality of the posterior probabilities and of the estimates of the prior probabilities, but only when the number of classes in the classification scheme is very small and the classifier is calibrated. As the number of classes grows, or as we use non-calibrated classifiers, SLD converges more slowly (and often does not converge at all), performance degrades rapidly, and the impact of SLD on the quality of the prior estimates and of the posteriors becomes negative rather than positive.

show abstract

Evaluation measures for quantification: an axiomatic approach

Cited by 40 publications

References 56 publications

QuaPy: A Python-Based Framework for Quantification

QuaPy: A Python-Based Framework for Quantification

Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales

A Critical Reassessment of the Saerens-Latinne-Decaestecker Algorithm for Posterior Probability Adjustment

Contact Info

Product

Resources

About