Why Cohen’s Kappa should be avoided as performance measure in classification

Delgado, Rosario; Tibau, Xavier-Andoni

doi:10.1371/journal.pone.0222916

Cited by 230 publications

(148 citation statements)

References 44 publications

Supporting

Mentioning

113

Contrasting

Unclassified

Order By: Relevance

“…The current most popular and widespread metrics include Cohen’s kappa [70–72]: originally developed to test inter-rater reliability, in the last decades Cohen’s kappa entered the machine learning community for comparing classifiers’ performances. Despite its popularity, in the learning context there are a number of issues causing the kappa measure to produce unreliable results (for instance, its high sensitivity to the distribution of the marginal totals [73–75]), stimulating research for more reliable alternatives [76]. Due to these issues, we chose not to include Cohen’s kappa in the present comparison study.…”

Section: Introductionmentioning

confidence: 99%

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

2020

View full text Add to dashboard Cite

BackgroundTo evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets.ResultsThe Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset.ConclusionsIn this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F1 score in evaluating binary classification tasks by all scientific communities.

show abstract

Section: Introductionmentioning

confidence: 99%

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

2020

View full text Add to dashboard Cite

show abstract

“…Calculated indices are all based on four [75]. Due to recent concerns about Cohen's Kappa Coefficient assessments and its undesired behavior discussed in Reference [76], the MCC metric was also calculated to ensure the validity of our evaluation. MCC values close to 1 represent perfect agreement, the value of 0 is interpreted as random prediction and the value of −1 is interpreted as complete opposite predictions for observations.…”

Section: Resultsmentioning

confidence: 99%

Flood Hazard Risk Mapping Using a Pseudo Supervised Random Forest

et al. 2020

View full text Add to dashboard Cite

Devastating floods occur regularly around the world. Recently, machine learning models have been used for flood susceptibility mapping. However, even when these algorithms are provided with adequate ground truth training samples, they can fail to predict flood extends reliably. On the other hand, the height above nearest drainage (HAND) model can produce flood prediction maps with limited accuracy. The objective of this research is to produce an accurate and dynamic flood modeling technique to produce flood maps as a function of water level by combining the HAND model and machine learning. In this paper, the HAND model was utilized to generate a preliminary flood map; then, the predictions of the HAND model were used to produce pseudo training samples for a R.F. model. To improve the R.F. training stage, five of the most effective flood mapping conditioning factors are used, namely, Altitude, Slope, Aspect, Distance from River and Land use/cover map. In this approach, the R.F. model is trained to dynamically estimate the flood extent with the pseudo training points acquired from the HAND model. However, due to the limited accuracy of the HAND model, a random sample consensus (RANSAC) method was used to detect outliers. The accuracy of the proposed model for flood extent prediction, was tested on different flood events in the city of Fredericton, NB, Canada in 2014, 2016, 2018, 2019. Furthermore, to ensure that the proposed model can produce accurate flood maps in other areas as well, it was also tested on the 2019 flood in Gatineau, QC, Canada. Accuracy assessment metrics, such as overall accuracy, Cohen’s kappa coefficient, Matthews correlation coefficient, true positive rate (TPR), true negative rate (TNR), false positive rate (FPR) and false negative rate (FNR), were used to compare the predicted flood extent of the study areas, to the extent estimated by the HAND model and the extent imaged by Sentinel-2 and Landsat satellites. The results confirm that the proposed model can improve the flood extent prediction of the HAND model without using any ground truth training data.

show abstract

“…Values close to −1 and +1 indicate performance much worse than chance, and much better than chance, respectively. As some concerns have been raised regarding the use of kappa as a performance measure in classification (Delgado and Tibau 2019), we also provide Matthew's Correlation Coefficients (MCC).…”

Section: Methodsmentioning

confidence: 99%

Do simple syntactic heuristics to verb meaning hold up? Testing the structure mapping account over spontaneous speech to Spanish-learning children

Audisio

Migdalek

2020

Can. J. Linguist.

View full text Add to dashboard Cite

Experimental research has shown that English-learning children as young as 19 months, as well as children learning other languages (e.g., Mandarin), infer some aspects of verb meanings by mapping the nominal elements in the utterance onto participants in the event expressed by the verb. The present study assessed this structure or analogical mapping mechanism (SAMM) on naturalistic speech in the linguistic environment of 20 Spanish-learning infants from Argentina (average age 19 months). This study showed that the SAMM performs poorly – at chance level – especially when only noun phrases (NPs) included in experimental studies of the SAMM were parsed. If agreement morphology is considered, the performance is slightly above chance but still very poor. In addition, it was found that the SAMM performs better on intransitive and transitive verbs, compared to ditransitives. Agreement morphology has a beneficial effect only on transitive and ditransitive verbs. On the whole, concerns are raised about the role of the SAMM in infants’ interpretation of verb meaning in natural exchanges.

show abstract

Why Cohen’s Kappa should be avoided as performance measure in classification

Cited by 230 publications

References 44 publications

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

Flood Hazard Risk Mapping Using a Pseudo Supervised Random Forest

Do simple syntactic heuristics to verb meaning hold up? Testing the structure mapping account over spontaneous speech to Spanish-learning children

Contact Info

Product

Resources

About