From Big to Smart Data: Iterative ensemble filter for noise filtering in Big Data classification

García-Gil, Diego; Sánchez, Francisco Luque; Luengo, Julián; García, Salvador; Herrera, Francisco

doi:10.1002/int.22193

Cited by 13 publications

(3 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Consequently, they proposed an ensemble filter based on C4.5, 1-NN, and linear machine. Similarly, some researchers [11], [12] proposed iterative partition filters that iteratively removed detected noise. In these methods, N filters were learned based on N group of N − 1 partitions and voted to detect noise on the entire dataset.…”

Section: Related Work a Data Cleaning Methodsmentioning

confidence: 99%

LR-BCA: Label Ranking for Bridge Condition Assessment

2021

View full text Add to dashboard Cite

Section: Related Work a Data Cleaning Methodsmentioning

confidence: 99%

LR-BCA: Label Ranking for Bridge Condition Assessment

2021

View full text Add to dashboard Cite

“…values [11] and reducing redundant [12] and noisy data [13] to obtain quality data from big datasets. In addition, there are contributions as proposed by Liu et al [14], where the results are improved and the runtime reduced in classification problems by selecting the appropriate classification rule according to a given neighborhood, instead of using the complete dataset.…”

Section: As a Key Technique Capable Of Imputing Missingmentioning

confidence: 99%

Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data

2020

Self Cite

View full text Add to dashboard Cite

It is recognized the importance of knowing the descriptive properties of a dataset when tackling a data science problem. Having information about the redundancy, complexity and density of a problem allows us to make decisions as to which data preprocessing and machine learning techniques are most suitable. In classification problems, there are multiple metrics to describe the overlapping of the features between classes, class imbalances or separability, among others. However, these metrics may not scale up well when dealing with big datasets, or may not simply be sufficiently informative in this context. In this paper, we provide a package of metrics for big data classification problems. In particular, we propose two new big data metrics: Neighborhood Density and Decision Tree Progression, which study density and accuracy progression by discarding half of the samples. In addition, we enable a number of basic metrics to handle big data. The experimental study carried out in standard big data classification problems shows that our metrics can quickly characterize big datasets. We identified a clear redundancy of information in most datasets, so that, discarding randomly 75% of the samples does not drastically affect the accuracy of the classifiers used. Thus, the proposed big data metrics, which are available as a Spark-Package, provide a fast assessment of the shape of a classification dataset prior to applying big data preprocessing, toward smart data.

show abstract

“…With the rapid development of Internet of Things (IoTs) technologies, 2–5 the current explosive growth of data 6 makes the information overload more and more serious, which can be solved by recommender systems 7–10 . Generally, the recommender systems can be classified into three categories: collaborative filtering methods, content‐based methods, and hybrid methods 11 .…”

Section: Introductionmentioning

confidence: 99%