2015
DOI: 10.1007/s10916-015-0312-5
|View full text |Cite
|
Sign up to set email alerts
|

A Data Preparation Methodology in Data Mining Applied to Mortality Population Databases

Abstract: It is known that the data preparation phase is the most time consuming in the data mining process, using up to 50 % or up to 70 % of the total project time. Currently, data mining methodologies are of general purpose and one of their limitations is that they do not provide a guide about what particular task to develop in a specific domain. This paper shows a new data preparation methodology oriented to the epidemiological domain in which we have identified two sets of tasks: General Data Preparation and Specif… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0
1

Year Published

2018
2018
2022
2022

Publication Types

Select...
9
1

Relationship

0
10

Authors

Journals

citations
Cited by 18 publications
(8 citation statements)
references
References 10 publications
(10 reference statements)
0
7
0
1
Order By: Relevance
“…The original dataset will thus cause poor performance of the subsequent prediction model and will become a bottleneck in the process of data mining. Therefore, the process of data preprocessing, including data reduction, data cleaning, data transformation and data integration, is crucial [57] and typically comprises 70~80% of the workload of data mining [58].…”
Section: Discussionmentioning
confidence: 99%
“…The original dataset will thus cause poor performance of the subsequent prediction model and will become a bottleneck in the process of data mining. Therefore, the process of data preprocessing, including data reduction, data cleaning, data transformation and data integration, is crucial [57] and typically comprises 70~80% of the workload of data mining [58].…”
Section: Discussionmentioning
confidence: 99%
“…This requires a good understanding of the data mining goals as well as the data itself (Pyle, Editor, & Cerra, 1999). Data selection, also called "Dimensionality Reduction" (Liu & Motoda, 1998), consists in vertical (attributes/variables) selection and horizontal (instance/records) selection (García, Luengo, & Herrera, 2015;Nisbet, Elder, & Miner, 2009;Pérez et al, 2015) (Table 6). Also, it is worth noticing that models obtained from a reduced number of features will be easier to understand (Pyle et al, 1999).…”
Section: Data Selectionmentioning
confidence: 99%
“…Another focus of the researchers in this domain was to compare the performance of simulation platforms. For instance, in [18] In [19], a set of individual classifiers involved in an ensemble classifier, solo classifiers and neural network classifiers was applied on 4 datasets provided by UCI: the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, the ILPD, the VCDS and the HDDS. Different from the similar studies, the focus of [20] was Fatty Liver Disease (FLD) and several methods such as Decision Tree, SVM, AdaBoost, KNN, Probabilistic Neural Network (PNN), Naive Bayes and Fuzzy Sugeno were used to work with normal and abnormal liver images through linear and quadratic discriminant analysis.…”
Section: Related Workmentioning
confidence: 99%