Machine Learning Methods Based Preprocessing to Improve Categorical Data Classification

Ruiz-Chavez, Zoila; Salvador-Meneses, Jaime; García-Rodríguez, José

doi:10.1007/978-3-030-03493-1_32

Cited by 3 publications

(5 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 15 and Table 16 , and Figure 10 show the compression of a real world census data taken from Base de Datos-Censo de Población y Vivienda 2010 ( ). The dataset is a subset of the original census used for imputation of missing data [ 30 ].…”

Section: Resultsmentioning

confidence: 99%

Compressed kNN: K-Nearest Neighbors with Data Compression

Salvador-Meneses

Ruiz-Chavez

García-Rodríguez

2019

Entropy

View full text Add to dashboard Cite

The kNN (k-nearest neighbors) classification algorithm is one of the most widely used non-parametric classification methods, however it is limited due to memory consumption related to the size of the dataset, which makes them impractical to apply to large volumes of data. Variations of this method have been proposed, such as condensed KNN which divides the training dataset into clusters to be classified, other variations reduce the input dataset in order to apply the algorithm. This paper presents a variation of the kNN algorithm, of the type structure less NN, to work with categorical data. Categorical data, due to their nature, can be compressed in order to decrease the memory requirements at the time of executing the classification. The method proposes a previous phase of compression of the data to then apply the algorithm on the compressed data. This allows us to maintain the whole dataset in memory which leads to a considerable reduction of the amount of memory required. Experiments and tests carried out on known datasets show the reduction in the volume of information stored in memory and maintain the accuracy of the classification. They also show a slight decrease in processing time because the information is decompressed in real time (on-the-fly) while the algorithm is running.

show abstract

Section: Resultsmentioning

confidence: 99%

Compressed kNN: K-Nearest Neighbors with Data Compression

Salvador-Meneses

Ruiz-Chavez

García-Rodríguez

2019

Entropy

View full text Add to dashboard Cite

show abstract

“…Yet, researchers can prefer clustering data using SOM for the following features: the network learns without a teacher, output neurons are ordered, the learning process is performed on the basis of competition and emergence of the winning neuron, a topological map is created and the neighborhood of clusters identified as a learning result. Traditionally, the method can handle numerical data, but owing to the dummy coding, SOM can also operate on categorical data [5,26]. Despite some weaknesses, the method is very attractive for data exploratory analysis, widely used to solve a large range of problems, and can be considered as an appropriate clustering algorithm for high dimensional dataset [3,5,10,26].…”

Section: Methodsmentioning

confidence: 99%

“…Traditionally, the method can handle numerical data, but owing to the dummy coding, SOM can also operate on categorical data [5,26]. Despite some weaknesses, the method is very attractive for data exploratory analysis, widely used to solve a large range of problems, and can be considered as an appropriate clustering algorithm for high dimensional dataset [3,5,10,26].…”

Section: Methodsmentioning

confidence: 99%

Use of data mining in a two‐step process of profiling student preferences in relation to the enhancement of English as a foreign language teaching

Nowakowska

Beben

Pajęcki

2020

Statistical Analysis

View full text Add to dashboard Cite

The paper pursues a twofold goal. The first goal refers to the identification of university students' needs regarding such modifications to English language courses that would improve English as a foreign language (EFL) teaching outcomes. The other goal refers to the methodical issue of achieving the first one. In this aspect, the use of selected data mining techniques in a hierarchical way in real data processing is proposed. These are: (a) Self‐Organizing Map (SOM) dataset segmentation and then (b) market basket analysis applied to the individual SOM segments. The research data were collected from the students' survey concerning their opinion of the EFL teaching process; 347 students of a faculty of a technical university in Poland completed the questionnaire. The use of SOM allowed the identification of homogeneous groups of students, while market basket analysis allowed indicating, within each group, the relationships between student opinions of effective methods of teaching English. In such a way, satisfactory student preference profiles as regards their approach to the improvement of English language competences were developed. On this basis, EFL teaching methods appropriate for the specific profile can be adapted.

show abstract

“…Each neuron effectively calculates a "weighted sum" of its input, adds a bias and decides whether or not to fire. This can be expressed as a function f applied to a linear classifier 𝐰 T 𝒙 + 𝑏 (8). The decision function f could be a non-linear threshold function, a non-linear distance function or a probability-like sigmoid function.…”

Section: ) Artificial Neural Network (Ann)mentioning

confidence: 99%

“…In contrast, categorical variables tend to hide (even mask) a great deal of the interesting information in a dataset [4,5,6]. It is not so easy to see trends and make predictions or forecasts when categorical variables dominate the dataset [7,8]. This makes it crucial to develop systematic methods and heuristics for dealing with such variables.…”

Section: Introductionmentioning

confidence: 99%

The Categorical Data Conundrum: Heuristics for Classification Problems—A Case Study on Domestic Fire Injuries

et al. 2022

View full text Add to dashboard Cite

Machine learning is well developed amongst the scientific community in terms of theoretical foundations (statistics and algorithms) and frameworks (Tensorflow, PyTorch, H2O). However, machine learning is heavily focused on numerical data, or numerical data mixed with some categorical data. For numerical datasets, scientists and engineers can enjoy reasonable success with only a limited knowledge of theoretical foundations and the inner workings of machine learning frameworks. However, it is a different story when dealing with purely categorical datasets, which require a deeper understanding of machine learning frameworks and associated encodings and algorithms in order to achieve success. This paper addresses the issues in handling purely categorical datasets for multi-classification problems and provides a set of heuristics for dealing with purely categorical data. In particular, issues such as pre-processing, feature encoding and algorithm selection are considered. The heuristics are then demonstrated through a case study, based on a categorical data set of domestic fire injuries, covering a 10-year period. Novel contributions are made through the heuristics and the performance analysis of different encoding techniques. The case study itself also makes a novel contribution through the classification of different types of injuries, based on related features.

show abstract

Machine Learning Methods Based Preprocessing to Improve Categorical Data Classification

Cited by 3 publications

References 6 publications

Compressed kNN: K-Nearest Neighbors with Data Compression

Compressed kNN: K-Nearest Neighbors with Data Compression

Use of data mining in a two‐step process of profiling student preferences in relation to the enhancement of English as a foreign language teaching

The Categorical Data Conundrum: Heuristics for Classification Problems—A Case Study on Domestic Fire Injuries

Contact Info

Product

Resources

About