Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.
Bug fixing has a key role in software quality evaluation. Bug fixing starts with the bug localization step, in which developers use textual bug information to find location of source codes which have the bug. Bug localization is a tedious and time consuming process. Information retrieval requires understanding the programme's goal, coding structure, programming logic and the relevant attributes of bug. Information retrieval (IR) based bug localization is a retrieval task, where bug reports and source files represent the queries and documents, respectively. In this paper, we propose BugCatcher, a newly developed bug localization method based on multi-level re-ranking IR technique. We evaluate BugCatcher on three open source projects with approximately 3400 bugs. Our experiments show that multi-level reranking approach to bug localization is promising. Retrieval performance and accuracy of BugCatcher are better than current bug localization tools, and BugCatcher has the best Top N, Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) values for all datasets.
Identification and location of defects in software projects is an important task to improve software quality and to reduce software test effort estimation cost. In software fault prediction domain, it is known that 20% of the modules will in general contain about 80% of the faults. In order to minimize cost and effort, it is considerably important to identify those most error prone modules precisely and correct them in time. Machine Learning (ML) algorithms are frequently used to locate error prone modules automatically. Furthermore, the performance of the algorithms is closely related to determine the most valuable software metrics. The aim of this research is to develop a Majority Vote based Feature Selection algorithm (MVFS) to identify the most valuable software metrics. The core idea of the method is to identify the most influential software metrics with the collaboration of various feature rankers. To test the efficiency of the proposed method, we used CM1, JM1, KC1, PC1, Eclipse Equinox, Eclipse JDT datasets and J48, NB, K-NN (IBk) ML algorithms. The experiments show that the proposed method is able to find out the most significant software metrics that enhances defect prediction performance.
ÖZETMetin tabanlı veri setleri üzerinde analiz işlemi gerçekleştirebilmek için Veri Madenciliğinin alt alanı olan Metin Madenciliği (MM) alanın-daki teknik ve yöntemler kullanılmaktadır. Bu çalışmada, akademik yayınlar üzerinde metin madenciliği yöntemleri kullanılarak akademik makalelerin sınıflara ayrılarak tasnif edilme başarısı ölçülmüştür. Bu amaçla bir akademik bilgi paylaşım ağı olan Research Gate üzerindeki belirli akademik yayınların özetleri, geliştirilen yazılım araçları kullanılarak elde edilmiş ve bu özetlerden bir veri seti oluşturulmuştur. Veri seti içerisindeki yayınlar "Materials Science & Engineering" ve "Social Sciences & Humanities" olmak üzere iki ayrı kategoride yer almaktadırlar. Veri seti üzerinde R dili ve R Studio araçlarından yararlanılarak sınıflandırma amacıyla K-En Yakın Komşu (KNN) algoritması kullanılmıştır. Çalışma sonucunda %96,67 oranında doğruluk (ACC) değeri bulunarak yayınların hangi sınıfa ait olduğu tespit edilmiştir.Anahtar Kelimeler: Metin Madenciliği, R, R Studio, KNN, Metin Sınıflama. Classification of Scientific Articles Using Text Mining with KNN Algorithm and R Language ABSTRACTIn order to perform analysis on text-based datasets, the techniques and methods in Text Mining (TM) which is a subdomain of Data Mining are used. In this study, it is aimed to evaluate the classification accuracy of academic articles which are produced in academic domain. In accordance with this purpose, the abstracts of the academic articles are obtained and a dataset is created from an academic knowledge sharing network named Research Gate by using self-developed software tools. The academic articles in the dataset fall into two categories as "Materials Science & Engineering" and "Social Sciences & Humanities". KNN (k-nearest neighbors) classification algorithm is performed by utilizing R language and R Studio tools on the dataset. The experimental results show that the classification accuracy (ACC) of KNN is obtained as 96.67%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.