Statistical analysis of the performance of four Apache Spark ML algorithms

Camele, Genaro; Hasperué, Waldo; Ronchetti, Franco; Quiroga, Facundo

doi:10.24215/16666038.22.e14

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Also, the dataset used in [50] was from the SEER repository. In [51], the highdimensional dataset for cancer prediction is used.…”

Section: Types Of Datasets Used In Related Workmentioning

confidence: 99%

A Survey Study on Proposed Solutions for Imbalanced Big Data

Razoqi,

Al-Talib

2024

Iraqi Journal of Science

View full text Add to dashboard Cite

Learning from imbalanced data has been a focus of studies for more than two decades of continuous development. Training data is considered imbalanced when the size of the positive (minority) class is neglected because of the large size of the negative (majority) class, in addition to the problem of deviating distributions of binary tasks. The appearance of big data brings new problems and challenges to the imbalance problem. Big Data announces the challenges with 5V: volume, velocity, veracity, value, and variety. This study relied on dividing the solution to the problem of data imbalance into three levels: data level, algorithm level, and hybrid approaches. First, the standard solutions for this problem that were proposed were mentioned, and in addition, the most important metrics adopted for measuring the classification efficiency of imbalanced data were identified. In this survey study, 27 studies were reviewed during the period 2015–2022, distributed according to the levels of treatment of the imbalance problem. They also reviewed the performance metrics that were used in these studies and the sources of the datasets to which these solutions were applied. The study makes it easier for researchers and scholars to see the solutions to addressing the problem of data imbalance and the hybrid approaches recently used for that, and to take advantage of them in improving the classification process.

show abstract

“…Also, the dataset used in [50] was from the SEER repository. In [51], the highdimensional dataset for cancer prediction is used.…”

Section: Types Of Datasets Used In Related Workmentioning

confidence: 99%