Do unbalanced data have a negative effect on LDA?

Xue, Jing-Hao; Titterington, D. M.

doi:10.1016/j.patcog.2007.11.008

Cited by 56 publications

(51 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, a study by Xue and Titterington [21] revealed that there is no reliable empirical evidence to support the claim that an unbalanced data set negatively impacts the performance of the LDA/BTM approaches. Further, a recent study by López et al [22] shows that the unbalanced ratio by itself does not have the most significant effect on the classifiers performance, but there are other issues such as (a) the presence of small disjuncts, (b) the lack of density, (c) the class overlapping, (d) the noisy data, (e) the management of borderline examples, and (f) the dataset shift that must be taken into account.…”

Section: Limitations and Threats To Validitymentioning

confidence: 99%

What Works Better? A Study of Classifying Requirements

Abad¹,

Karras

Ghazi

et al. 2017

2017 IEEE 25th International Requirements Engineering Conference (RE)

View full text Add to dashboard Cite

Abstract-Classifying requirements into functional requirements (FR) and non-functional ones (NFR) is an important task in requirements engineering. However, automated classification of requirements written in natural language is not straightforward, due to the variability of natural language and the absence of a controlled vocabulary. This paper investigates how automated classification of requirements into FR and NFR can be improved and how well several machine learning approaches work in this context. We contribute an approach for preprocessing requirements that standardizes and normalizes requirements before applying classification algorithms. Further, we report on how well several existing machine learning methods perform for automated classification of NFRs into sub-categories such as usability, availability, or performance. Our study is performed on 625 requirements provided by the OpenScience tera-PROMISE repository. We found that our preprocessing improved the performance of an existing classification method. We further found significant differences in the performance of approaches such as Latent Dirichlet Allocation, Biterm Topic Modeling, or Naïve Bayes for the sub-classification of NFRs.

show abstract

Section: Limitations and Threats To Validitymentioning

confidence: 99%

What Works Better? A Study of Classifying Requirements

Abad¹,

Karras

Ghazi

et al. 2017

2017 IEEE 25th International Requirements Engineering Conference (RE)

View full text Add to dashboard Cite

show abstract

“…1 shows a motivating example, using a scatter plot and a panel of nine boxplots of AUC to illustrate visually the fact that rebalancing the training data can often improve the performance of LDA in terms of AUC [5], [6]. This example is extracted from an experiment on simulated data arising from two four-dimensional, Gaussian-distributed classes C 0 and C 1 .…”

Section: Notationmentioning

confidence: 99%

“…This example is extracted from an experiment on simulated data arising from two four-dimensional, Gaussian-distributed classes C 0 and C 1 . With a slightly different setting, the experiment explores more rebalancing scenarios than in [6]. It includes the following four steps.…”

Section: Notationmentioning

confidence: 99%

“…Using the rebalanced training data can often increase (i.e. improve) the area (AUC) under the receiver operating characteristic (ROC) curve for the original, unbalanced test data [4], [5], [6], [7]. The AUC is a widely-used quantitative measure of classification performance, but the empirical property that rebalancing increases AUC lacks theoretical justification.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis?

Xue

Hall

2015

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Many established classifiers fail to identify the minority class when it is much smaller than the majority class. To tackle this problem, researchers often first rebalance the class sizes in the training dataset, through oversampling the minority class or undersampling the majority class, and then use the rebalanced data to train the classifiers. This leads to interesting empirical patterns. In particular, using the rebalanced training data can often improve the area under the receiver operating characteristic curve (AUC) for the original, unbalanced test data. The AUC is a widely-used quantitative measure of classification performance, but the property that it increases with rebalancing has, as yet, no theoretical explanation. In this note, using Gaussian-based linear discriminant analysis (LDA) as the classifier, we demonstrate that, at least for LDA, there is an intrinsic, positive relationship between the rebalancing of class sizes and the improvement of AUC. We show that the largest improvement of AUC is achieved, asymptotically, when the two classes are fully rebalanced to be of equal sizes.

show abstract

“…Even though the LDA has been extensively studied [7][8][9], the effect of unbalanced training datasets using electroencephalographic (EEG) data and the number of patterns necessary to reach a performance plateau have not been tested. That is, the point at which no significant performance gain will exist when adding more training patterns has not been determined.…”

Section: Introductionmentioning

confidence: 99%

Determination of an optimal training strategy for a BCI classification task with LDA

Gareis

Acevedo

Atum

et al. 2011

2011 5th International IEEE/EMBS Conference on Neural Engineering

View full text Add to dashboard Cite

Brain computer interfaces (BCIs) translate brain activity into computer commands. To enhance the performance of a BCI, it is necessary to improve the feature extraction techniques being applied to decode the users' intentions. Objective comparison methods are needed to analyze different feature extraction techniques. One possibility is to use the classifier performance as a comparative measure. In this paper, we study the behavior of linear discriminant analysis (LDA) when used to distinguish between electroencephalographic (EEG) signals with and without the presence of event related potentials (ERPs).

show abstract

Do unbalanced data have a negative effect on LDA?

Cited by 56 publications

References 12 publications

What Works Better? A Study of Classifying Requirements

What Works Better? A Study of Classifying Requirements

Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis?

Determination of an optimal training strategy for a BCI classification task with LDA

Contact Info

Product

Resources

About