Towards a taxonomy of standards in smart data

Lenk, Alexander; Bonorden, Leif; Hellmanns, Astrid; Roedder, Nico; Jaehnichen, Stefan

doi:10.1109/bigdata.2015.7363946

Cited by 19 publications

(21 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The problem of noise is a crucial step in transforming Big Data into Smart Data, especially in Big Data scenarios. With this proposal, we have enabled the practitioner to reach Smart Data from raw and low‐quality Big Data . Our noise filter is able to deal with Big Data problems in a short time, achieving a noise clean version of the dataset.…”

Section: Discussionmentioning

confidence: 99%

From Big to Smart Data: Iterative ensemble filter for noise filtering in Big Data classification

et al. 2019

View full text Add to dashboard Cite

The quality of the data is directly related to the quality of the models drawn from that data. For that reason, many research is devoted to improve the quality of the data and to amend errors that it may contain. One of the most common problems is the presence of noise in classification tasks, where noise refers to the incorrect labeling of training instances. This problem is very disruptive, as it changes the decision boundaries of the problem. Big Data problems pose a new challenge in terms of quality data due to the massive and unsupervised accumulation of data. This Big Data scenario also brings new problems to classic data preprocessing algorithms, as they are not prepared for working with such amounts of data, and these algorithms are key to move from Big to Smart Data. In this paper, an iterative ensemble filter for removing noisy instances in Big Data scenarios is proposed. Experiments carried out in six Big Data datasets have shown that our noise filter outperforms the current state‐of‐the‐art noise filter in Big Data domains. It has also proved to be an effective solution for transforming raw Big Data into Smart Data.

show abstract

Section: Discussionmentioning

confidence: 99%

From Big to Smart Data: Iterative ensemble filter for noise filtering in Big Data classification

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Once the Big Data has been analyzed, processed, interpreted and cleaned, it is possible to access it in a structured way. This transformation is the difference between "Big" and "Smart" Data [26].…”

Section: From Big Data To Smart Datamentioning

confidence: 99%

Enabling Smart Data: Noise filtering in Big Data classification

García-Gil

Luengo

García

et al. 2019

Information Sciences

128

View full text Add to dashboard Cite

In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same dictate. A common problem affecting data quality is the presence of noise, particularly in classification problems, where label noise refers to the incorrect labeling of training instances, and is known to be a very disruptive feature of data. However, in this Big Data era, the massive growth in the scale of the data poses a challenge to traditional proposals created to tackle noise, as they have difficulties coping with such a large amount of data. New algorithms need to be proposed to treat the noise in Big Data problems, providing high quality and clean data, also known as Smart Data. In this paper, two Big Data preprocessing approaches to remove noisy examples are proposed: an homogeneous ensemble and an heterogeneous ensemble filter, with special emphasis in their scalability and performance traits. The obtained results show that these proposals enable the practitioner to efficiently obtain a Smart Dataset from any Big Data classification problem.accepted that we have entered the Big Data era [31]. Big Data is the set of technologies that make processing such large amounts of data possible [7], while most of the classic knowledge extraction methods cannot work in a Big Data environment because they were not conceived for it.Big Data as concept is defined around five aspects: data volume, data velocity, data variety, data veracity and data value [24]. While the volume, variety and velocity aspects refer to the data generation process and how to capture and store the data, veracity and value aspects deal with the quality and the usefulness of the data. These two last aspects become crucial in any Big Data process, where the extraction of useful and valuable knowledge is strongly influenced by the quality of the used data.In Big Data, the usage of traditional preprocessing techniques [16,34,18] to enhance the data is even more time consuming and resource demanding, being unfeasible in most cases. The lack of efficient and affordable preprocessing techniques implies that the problems in the data will affect the models extracted. Among all the problems that may appear in the data, the presence of noise in the dataset is one of the most frequent. Noise can be defined as the partial or complete alteration of the information gathered for a data item, caused by an exogenous factor not related to the distribution that generates the data. Learning from noisy data is an important topic in machine learning, data mining and pattern recognition, as real world data sets may suffer from imperfections in data acquisition, transmission, storage, integration and categorization. Noise will lead to excessively complex models with deteriorated performance [49], resulting in even larger computing times for less value.The impact of noise in Big Data, among other pernicious traits, has not been disrega...

show abstract

“…Referring to the well‐known “garbage in, garbage out” principle, accumulating vast amounts of raw data will not guarantee quality results, but poor knowledge. Smart data refers to the development of tools capable of dealing with massive and unstructured data to reveal its value Lenk et al (). Once Smart Data are obtained, real time interactions with other business intelligence or transactional applications are affordable, evolving from data‐centered to learning organizations, where knowledge is the core instead of data management Iafrate ().…”

Section: Smart Data: Focusing On Value In Big Datamentioning

confidence: 99%

Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality data

Triguero

García-Gil

Maillo

et al. 2018

WIREs Data Min & Knowl

126

View full text Add to dashboard Cite

The k‐nearest neighbors algorithm is characterized as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data—likely to contain noise and imperfections—are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k‐nearest neighbors rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data—which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context are investigated. This includes a brief overview of Smart Data, current and future trends for the k‐nearest neighbor algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data‐ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thorough experimental analysis in a series of big datasets that provide guidelines as to how to use the k‐nearest neighbor algorithm to obtain Smart/Quality Data for a high‐quality data mining process. Moreover, multiple Spark Packages have been developed including all the Smart Data algorithms analyzed. This article is categorized under: Technologies > Data Preprocessing Fundamental Concepts of Data and Knowledge > Big Data Mining Technologies > Classification

show abstract

Towards a taxonomy of standards in smart data

Cited by 19 publications

References 4 publications

From Big to Smart Data: Iterative ensemble filter for noise filtering in Big Data classification

From Big to Smart Data: Iterative ensemble filter for noise filtering in Big Data classification

Enabling Smart Data: Noise filtering in Big Data classification

Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality data

Contact Info

Product

Resources

About