Improving the precision-recall trade-off in undersampling-based binary text categorization using unanimity rule

Erenel, Zafer; Altınçay, Hakan

doi:10.1007/s00521-012-1056-5

Cited by 19 publications

(12 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such imbalance poses a challenge for categorization, especially when the classes have a high degree of overlap [31]. One possible solution for this problem is balancing of the training-set or re-sampling, [5,10,39]. In a previous paper, we demonstrate that classifiers trained on balanced data perform better, on average, than classifiers trained using the original distribution of labels in the corpus [8].…”

Section: Puls Overviewmentioning

confidence: 98%

Improving Supervised Classification Using Information Extraction

Pierce

Pivovarova

et al. 2015

Natural Language Processing and Information Systems

View full text Add to dashboard Cite

Abstract. We explore supervised learning for multi-class, multi-label text classification, focusing on real-world settings, where the distribution of labels changes dynamically over time. We use the PULS Information Extraction system to collect information about the distribution of class labels over named entities found in text. We then combine a knowledge-based rote classifier with statistical classifiers to obtain better performance than either classification method alone. The resulting classifier yields a significant improvement in macro-averaged F-measure compared to the state of the art, while maintaining comparable micro-average.

show abstract

Section: Puls Overviewmentioning

confidence: 98%

Improving Supervised Classification Using Information Extraction

Pierce

Pivovarova

et al. 2015

Natural Language Processing and Information Systems

View full text Add to dashboard Cite

show abstract

“…The data-level approach is based on various re-sampling techniques, [2]. Some re-sampling techniques applied to the text classification task are described in [6,4,18]. Two approaches to re-sampling are oversampling, i.e., adding more instances of the minor classes into the training set, and under-sampling, i.e., removing instances of the major classes from the training set, [11].…”

Section: Related Workmentioning

confidence: 99%

“…For each classifier, the best threshold is trained on one random, originally-distributed development set; → and ∪ denote, respectively, two-stage and union combining methods, described in Section 6. 3±0.9 21.9±0 6. 19.7±0.6 31.5±0.5 22.4±0.6 26.2±0.5 NB+BNS 34.2±1.1 16.6±0.6 15.8±0.5 33.1±0.7 13.4±0.4 19.0±0.5 SVM+IG 31.9±1.3 59.2±1.1 37.1±1.2 30.5±0.4 72.7±0.6 42.9±0.4 SVM+BNS 32.7±0.9 55.2±1.0 36.2±0.7 30.1±0.5 70.8±0.6 42.2±0.5 Rote 35.0±0.8 67.6±1.0 43.8±0.8 42.4±0.6 64.2±0.4 51.1±0.5 Rote→NB+BNS 51.5±0.9 33.6±0.4 36.1±0.4 57.6±0.6 39.1±0.4 46.6±0.4 NB+BNS→Rote 49.7±1.0 24.0±0.2 26.9±0.3 53.3±0.4 23.7±0.3 32.8±0.3 Rote ∪ NB+BNS 59.2±0.9 25.4±0.3 30.7±0.3 64.3±0.5 26.2±0.3 37.2±0.3 Rote→NB+IG 51.8±0.9 39.8±0.6 41.5±0.6 59.1±0.5 47.3±0.4 52.5±0.4 NB+IG→Rote 48.7±1.0 31.5±0.5 33.4±0.4 53.0±0.5 36.3±0.3 43.1±0.3 Rote ∪ NB+IG 57.2±0.9 32.7±0.4 37.3±0.4 63.2±0.5 38.1±0.3 47.5±0.4 Rote→SVM+BNS 48.2±1.0 67.5±1.0 54.7±0.9 53.7±0.5 70.1±0.3 60.8±0.4 SVM+BNS→Rote 48.0±1.1 63.0±1.0 52.6±1.0 50.2±0.4 70.8±0.4 58.7±0.4 Rote ∪ SVM+BNS 54.0±0.9 62.0±0.8 56.1±0.8 58.5±0.4 68.2±0.3 63.0±0.3 Rote→SVM+IG 46.2±1.0 73.7±0.8 55.1±0.8 52.5±0.5 75.9±0.4 62.0±0.4 SVM+IG→Rote 47.0±1.2 67.7±0.9 53.7±1.1 49.9±0.3 73.9±0.3 59.6±0.3 Rote ∪ SVM+IG 52.2±1.1 66.3±0.8 56.9±0.9 57.7±0.4 71.1±0.3 63.7±0.4…”

mentioning

confidence: 97%

Supervised Classification Using Balanced Training

Pierce

Pivovarova

et al. 2014

Statistical Language and Speech Processing

View full text Add to dashboard Cite

We examine supervised learning for multi-class, multi-label text classification. We are interested in exploring classification in a realworld setting, where the distribution of labels may change dynamically over time. First, we compare the performance of an array of binary classifiers trained on the label distribution found in the original corpus against classifiers trained on balanced data, where we try to make the label distribution as nearly uniform as possible. We discuss the performance tradeoffs between balanced vs. unbalanced training, and highlight the advantages of balancing the training set. Second, we compare the performance of two classifiers, Naive Bayes and SVM, with several feature-selection methods, using balanced training. We combine a Named-Entity-based rote classifier with the statistical classifiers to obtain better performance than either method alone.

show abstract

“…To evaluate the performance of the DL classification, the F1 score was used as an evaluation metric [108], and it was calculated by Equation (11). In Equation (11), Precision, which is also called user accuracy, denotes the ratio of the number of correctly classified pixels to the number of pixels in the category that represents the classification result; Recall, which is also called producer accuracy, denotes the ratio of the number of correctly classified pixels to the actual number of pixels in the category [109].…”

Section: Accuracy Assessmentmentioning

confidence: 99%

Monitoring of Soil Salinization in the Keriya Oasis Based on Deep Learning with PALSAR-2 and Landsat-8 Datasets

et al. 2022

View full text Add to dashboard Cite

Currently, soil salinization is one of the main forms of land degradation and desertification. Soil salinization not only seriously restricts the development of agriculture and the economy, but also poses a threat to the ecological environment. The main purpose of this study is to map soil salinity in Keriya Oasis, northwestern China using the PALSAR-2 fully polarized synthetic aperture radar (PolSAR) L-band data and Landsat8-OLI (OLI) optical data combined with deep learning (DL) methods. A field survey is conducted, and soil samples are collected from 20 April 2015 to 1 May 2015. To mine the hidden information in the PALSAR-2 data, multiple polarimetric decomposition methods are implemented, and a wide range of polarimetric parameters and synthetic aperture radar discriminators are derived. The radar vegetation index (RVI) is calculated using PALSAR-2 data, while the normalized difference vegetation index (NDVI) and salinity index (SI) are calculated using OLI data. The random forest (RF)-integrated learning algorithm is used to select the optimal feature subset composed of eight polarimetric elements. The RF, support vector machine, and DL methods are used to extract different degrees of salinized soil. The results show that the OLI+PALSAR-2 image classification result of the DL classification was relatively good, having the highest overall accuracy of 91.86% and a kappa coefficient of 0.90. This method is helpful to understand and monitor the spatial distribution of soil salinity more effectively to achieve sustainable agricultural development and ecological stability.

show abstract

Improving the precision-recall trade-off in undersampling-based binary text categorization using unanimity rule

Cited by 19 publications

References 44 publications

Improving Supervised Classification Using Information Extraction

Improving Supervised Classification Using Information Extraction

Supervised Classification Using Balanced Training

Monitoring of Soil Salinization in the Keriya Oasis Based on Deep Learning with PALSAR-2 and Landsat-8 Datasets

Contact Info

Product

Resources

About