Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.
Data mining has been proven useful for knowledge discovery in many areas, ranging from marketing to medical and from banking to education. This study focuses on data mining and machine learning in textile industry as applying them to textile data is considered an emerging interdisciplinary research field. Thus, data mining studies, including classification and clustering techniques and machine learning algorithms, implemented in textile industry were presented and explained in detail in this study to provide an overview of how clustering and classification techniques can be applied in the textile industry to deal with different problems where traditional methods are not useful. This article clearly shows that a classification technique has higher interest than a clustering technique in the textile industry. It also shows that the most commonly applied classification methods are artificial neural networks and support vector machines, and they generally provide high accuracy rates in the textile applications. For the clustering task of data mining, a K-means algorithm was generally implemented in textile studies among the others that were investigated in this article. We conclude with some remarks on the strength of the data mining techniques for textile industry, ways to overcome certain challenges, and offer some possible further research directions.
In data mining, when using Naive Bayes classification technique, it is necessary to overcome the problem of how to deal with continuous attributes. Most previous work has solved the problem either by using discretization, normal method or kernel method. This study proposes the usage of different continuous probability distribution techniques for Naive Bayes classification. It explores various probability density functions of distributions. The experimental results show that the proposed probability distributions also classify continuous data with potentially high accuracy. In addition, this paper introduces a novel method, named NBC4D, which offers a new approach for classification by applying different distribution types on different attributes. The results (obtained classification accuracy rates) show that our proposed method (the usage of more than one distribution types) has success on real-world datasets when compared with the usage of only one well known distribution type.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.