Multiset Feature Learning for Highly Imbalanced Data Classification

Jing, Xiao‐Yuan; Zhang, Xinyu; Zhu, Xiaoke; Wu, Fei; You, Xinge; Gao, Yang; Shan, Shiguang; Yang, Jingyu

doi:10.1109/tpami.2019.2929166

Cited by 98 publications

(47 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The experiments in the first block were performed on artificial data sets taken from the paper by Napierala et al (2010) because using synthetic data allows us to know their characteristics a priori and analyze the effects of resampling in a fully controlled environment. The second group of experiments was on a well-known benchmark suite of real-life databases widely used for class imbalance problems (Chen et al, 2019;Jing et al, 2019;Kovács, 2019;Kuncheva et al, 2019;Lopez-Garcia et al, 2019), which are all available at the KEEL database repository (Alcalá-Fdez et al, 2011). The results of both experiments were estimated by 5-fold stratified cross-validation in order to have a sufficient amount of positive examples in the test partitions.…”

Section: Methodsmentioning

confidence: 99%

Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data

García

Sánchez

Marqués

et al. 2020

Expert Systems with Applications

View full text Add to dashboard Cite

Data plays a key role in the design of expert and intelligent systems and therefore, data preprocessing appears to be a critical step to produce high-quality data and build accurate machine learning models. Over the past decades, increasing attention has been paid towards the issue of class imbalance and this is now a research hotspot in a variety of fields. Although the resampling methods, either by undersampling the majority class or by over-sampling the minority class, stand among the most powerful techniques to face this problem, their strengths and weaknesses have typically been discussed based only on the class imbalance ratio. However, several questions remain open and need further exploration. For instance, the subtle differences in performance between the over-and under-sampling algorithms are still under-comprehended, and we hypothesize that they could be better explained by analyzing the inner structure of the data sets. Consequently, this paper attempts to investigate and illustrate the effects of the resampling methods on the inner structure of a data set by exploiting local neighborhood information, identifying the sample types in both classes and analyzing their distribution in each resampled set. Experimental results indicate that the resampling methods that pro

show abstract

Section: Methodsmentioning

confidence: 99%

Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data

García

Sánchez

Marqués

et al. 2020

Expert Systems with Applications

View full text Add to dashboard Cite

show abstract

“…As such, advancing the development of algorithms and approaches for improved identification of rare classes is a key challenge for deep learning-based taxonomic identification (25). Solutions to this challenge could be inspired by class resampling and cost-sensitive training (91) or by multiset feature learning (92).…”

Section: Potential Deep Learning Applications In Entomologymentioning

confidence: 99%

Deep learning and computer vision will transform entomology

Høye

Ärje

Bjerge

et al. 2021

Proc. Natl. Acad. Sci. U.S.A.

273

206

View full text Add to dashboard Cite

Most animal species on Earth are insects, and recent reports suggest that their abundance is in drastic decline. Although these reports come from a wide range of insect taxa and regions, the evidence to assess the extent of the phenomenon is sparse. Insect populations are challenging to study, and most monitoring methods are labor intensive and inefficient. Advances in computer vision and deep learning provide potential new solutions to this global challenge. Cameras and other sensors can effectively, continuously, and noninvasively perform entomological observations throughout diurnal and seasonal cycles. The physical appearance of specimens can also be captured by automated imaging in the laboratory. When trained on these data, deep learning models can provide estimates of insect abundance, biomass, and diversity. Further, deep learning models can quantify variation in phenotypic traits, behavior, and interactions. Here, we connect recent developments in deep learning and computer vision to the urgent demand for more cost-efficient monitoring of insects and other invertebrates. We present examples of sensor-based monitoring of insects. We show how deep learning tools can be applied to exceptionally large datasets to derive ecological information and discuss the challenges that lie ahead for the implementation of such solutions in entomology. We identify four focal areas, which will facilitate this transformation: 1) validation of image-based taxonomic identification; 2) generation of sufficient training data; 3) development of public, curated reference databases; and 4) solutions to integrate deep learning and molecular tools.

show abstract

“…Note that the dataset is common in both the criteria, giving us a total of 11 datasets. We choose these two categories because they are of special interest in research related to imbalanced datasets and have received extensive attention in this research area (Anand et al 2010;Hooda et al 2018;Jing et al 2019;Blagus and Lusa 2013).…”

Section: Datasets Used For Validationmentioning

confidence: 99%

LoRAS: an oversampling approach for imbalanced datasets

et al. 2020

View full text Add to dashboard Cite

The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our algorithm with 14 publicly available imbalanced datasets using three different Machine Learning (ML) algorithms and compared the performance of LoRAS, SMOTE and several SMOTE extensions that share the concept of using convex combinations of minority class data points for oversampling with LoRAS. We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.

show abstract

Multiset Feature Learning for Highly Imbalanced Data Classification

Cited by 98 publications

References 59 publications

Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data

Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data

Deep learning and computer vision will transform entomology

LoRAS: an oversampling approach for imbalanced datasets

Contact Info

Product

Resources

About