IntroductionK-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final classification output.Case descriptionSince the Euclidean distance function is the most widely used distance metric in k-NN, no study examines the classification performance of k-NN by different distance functions, especially for various medical domain problems. Therefore, the aim of this paper is to investigate whether the distance function can affect the k-NN performance over different medical datasets. Our experiments are based on three different types of medical datasets containing categorical, numerical, and mixed types of data and four different distance functions including Euclidean, cosine, Chi square, and Minkowsky are used during k-NN classification individually.Discussion and evaluationThe experimental results show that using the Chi square distance function is the best choice for the three different types of datasets. However, using the cosine and Euclidean (and Minkowsky) distance function perform the worst over the mixed type of datasets.ConclusionsIn this paper, we demonstrate that the chosen distance function can affect the classification accuracy of the k-NN classifier. For the medical domain datasets including the categorical, numerical, and mixed types of data, K-NN based on the Chi square distance function performs the best.
Breast cancer is an all too common disease in women, making how to effectively predict it an active research problem. A number of statistical and machine learning techniques have been employed to develop various breast cancer prediction models. Among them, support vector machines (SVM) have been shown to outperform many related techniques. To construct the SVM classifier, it is first necessary to decide the kernel function, and different kernel functions can result in different prediction performance. However, there have been very few studies focused on examining the prediction performances of SVM based on different kernel functions. Moreover, it is unknown whether SVM classifier ensembles which have been proposed to improve the performance of single classifiers can outperform single SVM classifiers in terms of breast cancer prediction. Therefore, the aim of this paper is to fully assess the prediction performance of SVM and SVM ensembles over small and large scale breast cancer datasets. The classification accuracy, ROC, F-measure, and computational times of training SVM and SVM ensembles are compared. The experimental results show that linear kernel based SVM ensembles based on the bagging method and RBF kernel based SVM ensembles with the boosting method can be the better choices for a small scale dataset, where feature selection should be performed in the data pre-processing stage. For a large scale dataset, RBF kernel based SVM ensembles based on boosting perform better than the other classifiers.
IntroductionMore and more universities are receiving accreditation from the Association to Advance Collegiate Schools of Business (AACSB), which is an international association for promoting quality teaching and learning at business schools. To be accredited, the schools are required to meet a number of standards ensuring that certain levels of teaching quality and students’ learning are met. However, there are a variety of points of view espoused in the literature regarding the relationship between research and teaching, some studies have demonstrated that research and teaching these are complementary elements of learning, but others disagree with these findings.Case descriptionUnlike past such studies, we focus on analyzing the research performance of accredited schools during the period prior to and after receiving accreditation. The objective is to answer the question as to whether performance has been improved by comparing the same school’s performance before and after accreditation. In this study, four AACSB accredited universities in Taiwan are analyzed, including one teaching oriented and three research oriented universities. Research performance is evaluated by comparing seven citation statistics, the number of papers published, number of citations, average number of citations per paper, average citations per year, h-index (annual), h-index, and g-index.Discussion and evaluationThe analysis results show that business schools demonstrated enhanced research performance after AACSB accreditation, but in most accredited schools the proportion of faculty members not actively doing research is larger than active ones.ConclusionThis study shows that the AACSB accreditation has a positive impact on research performance. The findings can be used as a reference for current non-accredited schools whose research goals are to improve their research productivity and quality.
Purpose: Data mining is widely considered necessary in many business applications for effective decision making. The importance of business data mining is reflected by the existence of numerous surveys in the literatures focusing on the investigation of related works using data mining techniques for solving specific business problems. However, there has been no recent study answering the following question: What are the widely used data mining techniques in business applications? Design/methodology/approach: The aim of this paper is to examine related surveys in the literature and thus to identify the frequently applied data mining techniques. To ensure the recent relevance and quality of the conclusions, the criterion for selecting related studies are that the works be published in reputed journals within the past 10 years. Findings: There are 33 different data mining techniques employed in eight different application areas. Most of them are supervised learning techniques and the application area where such techniques are most often been is bankruptcy prediction, followed by the areas of customer relationship management, fraud detection, intrusion detection, and recommender systems. Furthermore, the widely used 10 data mining techniques for business applications are the decision tree (including C4.5 and CART), genetic algorithm, k-nearest neighbor, multilayer perceptron neural network, naïve Bayes, and support vector machine as the supervised learning techniques and association rule, expectation maximization, and k-means as the unsupervised learning techniques. Originality/value: The originality of this paper is to survey the recent ten years of related survey and review articles about data mining in business applications in order to identify the most popular techniques.
BACKGROUND: The size of medical datasets is usually very large, which directly affects the computational cost of the data mining process. Instance selection is a data preprocessing step in the knowledge discovery process, which can be employed to reduce storage requirements while also maintaining the mining quality. This process aims to filter out outliers (or noisy data) from a given (training) dataset. However, when the dataset is very large in size, more time is required to accomplish the instance selection task. OBJECTIVE: In this paper, we introduce an efficient data preprocessing approach (EDP), which is composed of two steps. The first step is based on training a model over a small amount of training data after preforming instance selection. The model is then used to identify the rest of the large amount of training data. METHODS: Experiments are conducted based on two medical datasets for breast cancer and protein homology prediction problems that contain over 100000 data samples. In addition, three well-known instance selection algorithms are used, IB3, DROP3, and genetic algorithms. On the other hand, three popular classification techniques are used to construct the learning models for comparison, namely the CART decision tree, k-nearest neighbor (k-NN), and support vector machine (SVM). RESULTS:The results show that our proposed approach not only reduces the computational cost by nearly a factor of two or three over three other state-of-the-art algorithms, but also maintains the final classification accuracy. CONCLUSIONS: To perform instance selection over large scale medical datasets, it requires a large computational cost to directly execute existing instance selection algorithms. Our proposed EDP approach solves this problem by training a learning model to recognize good and noisy data. To consider both computational complexity and final classification accuracy, the proposed EDP has been demonstrated its efficiency and effectiveness in the large scale instance selection problem.
Purpose -Churn prediction is a very important task for successful customer relationship management. In general, churn prediction can be achieved by many data mining techniques. However, during data mining, dimensionality reduction (or feature selection) and data reduction are the two important data preprocessing steps. In particular, the aims of feature selection and data reduction are to filter out irrelevant features and noisy data samples, respectively. The purpose of this paper, performing these data preprocessing tasks, is to make the mining algorithm produce good quality mining results. Design/methodology/approach -Based on a real telecom customer churn data set, seven different preprocessed data sets based on performing feature selection and data reduction by different priorities are used to train the artificial neural network as the churn prediction model. Findings -The results show that performing data reduction first by self-organizing maps and feature selection second by principal component analysis can allow the prediction model to provide the highest prediction accuracy. In addition, this priority allows the prediction model for more efficient learning since 66 and 62 percent of the original features and data samples are reduced, respectively. Originality/value -The contribution of this paper is to understand the better procedure of performing the two important data preprocessing steps for telecom churn prediction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.