Addressing imbalance in multilabel classification: Measures and random resampling algorithms

Charte, Francisco; Rivera, Antonio J.; Jesus, María José del; Herrera, Francisco

doi:10.1016/j.neucom.2014.08.091

Cited by 224 publications

(158 citation statements)

References 35 publications

Supporting

Mentioning

156

Contrasting

Order By: Relevance

“…The same circumstance affects to other minority labels in this MLD, as well as most of the remainder MLDs used frequently in the literature. Many published works claim the intrinsically imbalanced nature of MLDs, a fact experimentally stated in [12] by means of specific measures. In this section, the several published ways to accomplish classification with imbalanced MLDs are depicted, organized according to three well-known approaches: algorithmic adaptations, ensemble-based methods and resampling techniques.…”

Section: Imbalance In Multilabel Classificationmentioning

confidence: 99%

“…The proposed solution could be also considered a transformation method, since its output is intended to be processed with binary classification algorithms instead of MLC algorithms. Two undersampling and two oversampling algorithms are presented in [12], one of them based on some specific measures introduced in [16] and directed to assess the imbalance level in MLDs. As in other studies [14], the conducted experimentation discovers that the oversampling methods perform usually better than the undersampling ones.…”

Section: Resampling Techniques Proposalsmentioning

confidence: 99%

“…It is also present in multilabel classification (MLC), since labels are unevenly distributed in most MLDs. To deal with imbalance in MLC, methods based on algorithmic adaptations [6][7][8], the use of ensembles [9,10], and resampling techniques [11][12][13] have been proposed.…”

Section: Introductionmentioning

confidence: 99%

“…The new samples can be clones of existent ones, or be synthetically produced as in SMOTE (Synthetic Minority Over-sampling Technique) [15]. Multilabel oversampling algorithms based on the cloning approach have been proposed in [12,13], being demonstrated its capability to deliver an improvement in classification results. A synthetic approach to produce new samples in MLDs is still to be faced.…”

Section: Introductionmentioning

confidence: 99%

“…Nevertheless it is an interesting way and deserves to be taken into account, as the results in [12] have shown. Since oversampling algorithms seem to produce better results, designing a more advanced method to produce new data samples could be worth the effort.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation

Charte

Rivera

Jesus

et al. 2015

Knowledge-Based Systems

Self Cite

192

112

View full text Add to dashboard Cite

Section: Imbalance In Multilabel Classificationmentioning

confidence: 99%

Section: Resampling Techniques Proposalsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation

Charte

Rivera

Jesus

et al. 2015

Knowledge-Based Systems

Self Cite

192

112

View full text Add to dashboard Cite

Optimizing unbalanced text classification tasks by integrating critical data mining and restricted rewriting techniques

Zhou

Wang

et al. 2020

Concurrency and Computation

View full text Add to dashboard Cite

Oversampling technology has been widely used to improve the classification task of unbalanced data. However, unlike structured data, the basic unit of text is words or characters, which can cause oversampling instances in digital space to lose word similarity in semantic space. To solve this problem, use text rewriting to directly generate artificial samples. Unfortunately, existing rewriting techniques usually destroy the grammatical structure and logic of the original text. In this article, we improve and limit some existing text rewriting methods, and propose an effective algorithm to mine feature words in various texts to help complete text rewriting. At the same time, by calculating the similarity between texts, various types of data are divided into key data and non-key data, and finally different rewriting processes are designed for them. The experimental results of four unbalanced text classification tasks show that our method is superior to the previous text rewriting method, which can improve the classification accuracy of the model by 1.7% to 2.9%, and the AUC can be increased by 0.012 to 0.058. The ablation experiment also explored the effects of various variables and methods on the experimental results. K E Y W O R D S Oversampling, Unbalanced data, Text rewriting, Mining feature words, Key data 1 INTRODUCTION 1.1 Background Text classification is one of the most basic tasks in natural language processing (NLP). With pre-trained word vectors, 1,2 Pay attention to the mechanism, 3 In the past decade, with the development of other technologies, the accuracy of classification has been improved to a higher level by many novel NLP networks. 4-6 Given that most of the previous literature is based on the assumption that the number of each category in the target data is balanced. The high performance of the classifier usually depends on the size and quality of the training data. On the contrary, the data distribution in the actual scene 7 Tend to be skewed. Because the features are not obvious enough, some samples of a small number of categories (called minority categories) can be easily classified as the categories with the largest amount of data (called majority categories), which leads to the problem of imbalance in data classification. Currently, there are many studies 8,9 Solve the problem of class imbalance. The most common strategy involves re-sampling the original data, which aims to mitigate the effects of unbalanced data by changing the spatial distribution of the samples. This technique can be divided into oversampling and undersampling. On the one hand, random copy, 10 A common and simple oversampling method is usually used to handle a small number of samples. However, this operation does not add other information, such as words, phrases, or sentences. It simply copies the original text randomly

show abstract

Prediction of the Chemical Context for Buchwald‐Hartwig Coupling Reactions

Genheden

Mårdh

Lahti

et al. 2022

Molecular Informatics

View full text Add to dashboard Cite

We present machine learning models for predicting the chemical context for Buchwald-Hartwig coupling reactions, i. e., what chemicals to add to the reactants to give a productive reaction. Using reaction data from inhouse electronic lab notebooks, we train two models: one based on single-label data and one based on multi-label data. Both models show excellent top-3 accuracy of approximately 90 %, which suggests strong predictivity. Furthermore, there seems to be an advantage of including multi-label data because the multi-label model shows higher accuracy and better sensitivity for the individual contexts than the single-label model. Although the models are performant, we also show that such models need to be re-trained periodically as there is a strong temporal characteristic to the usage of different contexts. Therefore, a model trained on historical data will decrease in usefulness with time as newer and better contexts emerge and replace older ones. We hypothesize that such significant transitions in the context-usage will likely affect any model predicting chemical contexts trained on historical data. Consequently, training context prediction models warrants careful planning of what data is used for training and how often the model needs to be re-trained.

show abstract

Addressing imbalance in multilabel classification: Measures and random resampling algorithms

Cited by 224 publications

References 35 publications

MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation

MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation

Optimizing unbalanced text classification tasks by integrating critical data mining and restricted rewriting techniques

Prediction of the Chemical Context for Buchwald‐Hartwig Coupling Reactions

Contact Info

Product

Resources

About