Image data are normally unstructured and high dimensional due to the photography technology advancement such that an image can be taken at a wide range of resolution levels. To overcome such problem, data miners may consider selecting only a minimal set of features that are really important for classifying their images. Feature selection is a popular method for reducing dimensions in data. However, most feature selection algorithms return results in form of score for each feature. It is still difficult for data miners to choose features based on such scoring scheme because they may not know which score range is the best for their data classification at hand. Therefore, in this research, we aim to assist data miners and novice data analysts on solving dimensionality problem by finding for them the best optimal set of features, instead of just reporting the scores of all features and leaving the selection step to be the burden of miners. We select optimal set of features by firstly apply clustering technique to group similar features based on their scores. We thus propose the silhouette width criterion for selecting the optimal number of clusters during the cluster analysis step. After that we perform association mining to analyze relationships that may exist among different subsets of features toward the target attribute. Our method finally reports user the best subset of features to be potentially used further for data classification. We demonstrate performance of our proposed method on the satellite forest image data in Japan.
Water is an important part of our daily lives: food, manufacture, agriculture, etc. When water is not enough for all population, it leads to many undesirable impacts including drought, famine and death. The solution to this problem is the good management of water resources. The management of water resources is planning and designing of projects related to water. The runoff prediction is one major part of planning. It is a complex process and it also needs an adequate modeling technique for accurate prediction. Therefore, we propose to use combined algorithms to improve prediction performance. Our combination includes the two powerful methods: Artificial Neural Network (ANN) and Support Vector Regression (SVR). The root mean square error (RMSE) and the correlation coefficient (R) are two criteria that we use to evaluate the model performance regarding the comparison between actual runoff and the prediction made by our model. We also compare performance of our model against the other algorithms: Linear Regression, ANN, and Support Vector Machines. The comparison results show that our proposed method shows the best performance and the combined model is also quite accurate on predicting the peak runoff values during heavy rain season.Index Terms-Runoff prediction, artificial neural network, support vector regression, Mun Basin.
The general datamining algorithm also classify the balanced dataset, when the data have imbalanced the predicted rate over minority class is still low. The random sampling techniques has been applying to solve the imbalanced data, but sometimes the random technique has selected the features is clearly different from both, when the unseen data (from minority class) has features look like the majority class, the classification model show miss classification because the model learning sample data does not complete. To improve the performance to classify the data, the genetic algorithm is applying to finding the optimal parameter, but sometimes the genetic algorithm cannot find the best set of parameters because the random initial population is not cover the best set of parameters, in this research proposed the techniques to guarantee the genetic algorithm can find the optimal parameter by using restarting technique to recreate the initial population when the new generation show powerful less than the old population. The results show that proposed technique can improve the performance to classify the minority class from imbalanced dataset more than the other techniques.
Abstract-The objective of this research is to study top-k ranking in the queries that are ambiguous. In this paper we demonstrate our query answering strategy for ranking world population for deductive data base. The cause of most problems are wrong ranking because some questions have ambiguity such as "Find the country which have the population between 1,500,000 and 3,000,000 people by the most densely population is approximately 2,400,000 people." This research proposes top-k ranking technique using membership function to evaluate and rank possible answers. We show comparative results for each kind of membership function.Index Terms-ER data mining, datalogtop-k, datalog, deductive database.
Algorithms for data classification are normally at their high performance when the dataset has good balance in which the number of data instances in each class is approximately equal. But when the dataset is imbalanced, the classification model tends to bias toward the majority class. The goal of imbalanced data classification is how to improve the performance of a model to better recognize data from minority class, especially when minority is more interesting than the majority data. In this research, we propose technique for balancing data with hybrid resampling techniques and then perform parameter optimization with restarting genetic algorithm. The optimized parameters are for support vector machine to induce efficient model for recognizing data in minority class, whereas maintaining overall accuracy. The experimental results show that the proposed technique has high performance than others.
Data classification mining is a method to find data generalization in a form of rules then used these rules to predict some unknown value in the future data. But in actual applications, the rules may be of low accuracy and the number of rules may be so overwhelmed that users could not efficiently apply them. Therefore, this research proposes the development of data classification algorithm with compact fuzzy association rules to optimize accuracy and interpretability of the model. To evaluate the performance of the proposed method, this research will compare accuracy of the classification model and the number of rules against 9 different data classification algorithms. The results showed that our CCFAR algorithm is comparable in terms of accuracy. When considering both accuracy and size of model, our algorithm is the best one.
The aim of this paper is to perform a comparative study of feature reduction techniques that are most appropriate for the classification with k-nearest neighbor and tested with medical data. Medical data are normally high-dimensional in their nature. Their high dimensionality property can affect performance of the classification process. In this work, we perform various feature reduction techniques implemented with Matlab to decrease dimensions of data before the knearest neighbor classification step. From the experimented results, we found that best performance is obtained from using the PCA algorithm to reduce features of data. The comparison in terms of accuracy turns out that PCA and ROC feature reduction techniques can improve the classification prediction, whereas the t-test feature reduction has very limited effect over the classification accuracy.
Data mining is the process to find the knowledge from the huge amount of stored information and use the discovered knowledge to predict or classify the new data item that its class label is unknown. Among many available algorithms to do data classification, support vector machine is one of the most accurate mining methods. Support vector machine is a parametric approach such that proper setting of parameter value can directly influence the classifying performance of the machine. Currently, genetic algorithm can find the best parameter for support vector machine. The genetic algorithm is the search algorithm for optimal answer with adaptive heuristic search based on the evolutionary characteristic of nature. But the problem of genetic algorithm is that sometime the algorithm cannot find the best parameter because the improper setting of a random initial value. In this research, we propose the new technique to improve performance of genetic algorithm to find the best parameter with restarting concept. We show the performance of the proposed technique with application for image-based forest type classification over the forest area in Japan with the satellite image data from the ASTER satellite. The results show that the proposed technique can classify the forest type more accurate than other existing techniques.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.