Achieving high level of data quality is considered one of the most important assets for any small, medium and large size organizations. Data quality is the main hype for both practitioners and researchers who deal with traditional or big data. The level of data quality is measured through several quality dimensions. High percentage of the current studies focus on assessing and applying data quality on traditional data. As we are in the era of big data, the attention should be paid to the tremendous volume of generated and processed data in which 80% of all the generated data is unstructured. However, the initiatives for creating big data quality evaluation models are still under development. This paper investigates the data quality dimensions that are mostly used in both traditional and big data to figure out the metrics and techniques that are used to measure and handle each dimension. A complete definition for each traditional and big data quality dimension, metrics and handling techniques are presented in this paper. Many data quality dimensions can be applied to both traditional and big data, while few number of quality dimensions are either applied to traditional data or big data. Few number of data quality metrics and barely handling techniques are presented in the current works.
No abstract
Abstract-Data classification is one of the most important tasks in data mining, which identify to which categories a new observation belongs, on the basis of a training set. Preparing data before doing any data mining is essential step to ensure the quality of mined data. There are different algorithms used to solve classification problems. In this research four algorithms namely support vector machine (SVM), C5.0, K-nearest neighbor (KNN) and Recursive Partitioning and Regression Trees (rpart) are compared before and after applying two feature selection techniques. These techniques are Wrapper and Filter. This comparative study is implemented throughout using R programming language. Direct marketing campaigns dataset of banking institution is used to predict if the client will subscribe a term deposit or not. The dataset is composed of 4521 instances. 3521 instance as training set 78%, 1000 instance as testing set 22%. The results show that C5.0 is superior to other algorithms before implementing FS technique and SVM is superior to others after implementing FS. Keywords-Classification, Feature Selection, Wrapper Technique, Filter Technique, Support Vector Machine (SVM), C5.0, K-Nearest Neighbor (KNN), Recursive Partitioning and Regression Trees (Rpart). I. INTRODUCTIONThe problem of data classification has numerous applications in a wide variety of mining applications. This is because the problem attempts to learn the relationship between a set of feature variables and a target variable of interest. Excellent overviews on data classification may be found in Classification algorithms typically contain two phases. The first one is training phase in which a model is constructed from the training instances. The second is testing phase in which the model is used to assign a label to an unlabeled test instance [1].Classification consists of predicting a certain outcome based on a given input. In order to predict the outcome, the algorithm processes a training set containing a set of attributes and the respective outcome, usually called goal or prediction attribute. The algorithm tries to discover relationships between the attributes that would make it possible to predict the outcome. Next the algorithm is given a data set, called prediction set, which contains the same set of attributes, except for the prediction attribute -not yet known. The algorithm analyses the input and produces predicted instances. The prediction accuracy defines how "good" the algorithm is [2]. The four classifiers used in this paper are shown in (figure 1). But many irrelevant, noisy or ambiguous attributes may be present in data to be mined. So they need to be removed because it affects the performance of algorithms. Attribute selection methods are used to avoid over fitting and improve model performance and to provide faster and more cost-effective models [3]. The main purpose of Feature Selection (FS) approach is to select a minimal and relevant feature subset for a given dataset and maintain its original representation. FS not only reduce...
Most real-world datasets contaminated by quality issues have a severe effect on the analysis results. Duplication is one of the main quality issues that hinder these results. Different studies have tackled the duplication issue from different perspectives. However, revealing the sensitivity of supervised and unsupervised learning models under the existence of different types of duplicates, deterministic and probabilistic, is not broadly addressed. Furthermore, a simple metric is used to estimate the ratio of both types of duplicates regardless of the probability by which the record is considered duplicate. In this paper, the sensitivity of five classifiers and four clustering algorithms toward deterministic and probabilistic duplicates with different ratios (0% -15%) is tracked. Five evaluation metrics are used to accurately track the changes in the sensitivity of each learning model, MCC, F1-Score, Accuracy, Average Silhouette Coefficient, and DUNN Index. Also, a metric to measure the ratio of probabilistic duplicates within a dataset is introduced. The results revealed the effectiveness of the proposed metric that reflects the ratio of probabilistic duplicates within the dataset. All learning models, classification, and clustering models are differently sensitive to the existence of duplicates. RF and Kmeans are positively affected by the existence of duplicates which means that their performce increase as the percentage of duplicates increases. Furthermore, the rest of classifiers and clustering algorithms are sensitive toward duplicates existence, especially within high percentage that negatively affect their performance.
Data completeness is one of the most common challenges that hinder the performance of data analytics platforms. Different studies have assessed the effect of missing values on different classification models based on a single evaluation metric, namely, accuracy. However, accuracy on its own is a misleading measure of classifier performance because it does not consider unbalanced datasets. This paper presents an experimental study that assesses the effect of incomplete datasets on the performance of five classification models. The analysis was conducted with different ratios of missing values in six datasets that vary in size, type, and balance. Moreover, for unbiased analysis, the performance of the classifiers was measured using three different metrics, namely, the Matthews correlation coefficient (MCC), the F1-score, and accuracy. The results show that the sensitivity of the supervised classifiers to missing data differs according to a set of factors. The most significant factor is the missing data pattern and ratio, followed by the imputation method, and then the type, size, and balance of the dataset. The sensitivity of the classifiers when data are missing due to the Missing Completely At Random (MCAR) pattern is less than their sensitivity when data are missing due to the Missing Not At Random (MNAR) pattern. Furthermore, using the MCC as an evaluation measure better reflects the variation in the sensitivity of the classifiers to the missing data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.