Klasifikasi konten berita politik menggunakan algoritma K-Nearest Neighbor merupakan suatu proses untuk mengklasifikasikan berita politik ke dalam tiga subkategori yang lebih spesifik yaitu pilkada, UU ORMAS dan reshuffle kabinet. Algoritma yang digunakan dalam penelitian ini adalah algoritma K-Nearest Neighbor. Algoritma K-Nearest Neighbor merupakan suatu pendekatan klasifikasi yang mencari semua data training yang paling relatif mirip atau memiliki jarak yang paling dekat dengan data testing. Algoritma ini dipilih karena K-Nearest Neighbor merupakan algoritma yang sederhana dengan mencari kategori mayoritas sebanyak nilai K yang telah ditentukan sebelumnya. nilai K yang digunakan pada penelitian ini adalah K=3, K=5, K=7 dan K=9. Mekanisme dari sistem klasifikasi konten berita ini dimulai dengan tahap preprocessing. Berita politik yang dimasukkan kedalam sistem akan melewati empat tahap preprocessing yaitu case folding, tokenizing, stopword dan stemming. Tahap selanjutnya yaitu tahap pembobotan term. Pembobotan atau term weighting merupakan proses mendapatkan nilai term yang berhasil diekstrak dari proses sebelumnya yaitu proses preprocessing. Algoritma yang digunakan untuk tahap pembobotan pada penelitian ini adalah algoritma TFIDF. Setelah didapatkan nilai dari bobot term, kemudian dicari nilai jarak antar dokumen menggunakan algoritma cosine similarity. Langkah berikutnya adalah melakukan pengurutan data dalam data training berdasarkan hasil perhitungan nilai jarak. Selanjutnya, dari hasil pengurutan tersebut diambil sejumlah K data yang memiliki nilai kedekatan. Tujuan dari penelitian ini adalah sistem mampu mengimplementasikan algoritma KNN pada dokumen yang memiliki similarity yang tinggi. Pada penelitian ini dilakukan 3 pengujian dengan tiga variasi dataset yang berbeda dengan empat nilai K. Hasil akurasi yang terbaik didapatkan ketika sistem menggunakan nilai K=9 yang menunjukkan nilai precision sebesar 100%, recall sebesar 100% dan nilai f-measure sebesar 100%. Kata Kunci: klasifikasi, algoritma K-Nearest Neighbor, TFIDF, cosine similarity, confusion matrix.
K-Means is a well known algorithms of clusteing. It generates some groups based on degree of similarity. Simplicity of implementation, ease of interpretation, adaptability to sparse data, linear complexity, speed of convergence, and versatile in almost every aspect are noble characteristics of this algorithm. However, this algorithm is very sensitive on defining initial centroids process. Giving a bad initial centroid always produces a bad quality output. Due to this weakness, it is recommended to make some runs with different initial centroids and select the initial centroid that produces cluster with minimum error. However, this procedure is hard to achieve a satisfying result. This paper introduces a new approach to minimize the initial centroid problem of K-Means algorithm. This approach focus on centroid updating stage in K-Means algorithm by applying minimum forest graph to produce better new centroids. Based on gain information and Dunn index values, this approach provided a better result than Forgy method when this approach tested on both well distributed and noisy dataset. Moreover, from the experiments with two dimentional data, the proposed approach produced consisten members of each cluster in every run, where it could not be found in Forgy method.
The development of information technology causes a large number of digital documents, especially thesis documents, so that it can create opportunities for students to take the same and not varied topics. Thesis documents can be grouped by topic by identifying the abstract section. The results of the grouping can be seen with the trend with data visualization so that it can be analyzed to find out the trend of each topic. Retrieval of data in the repository of the University of Jember through a web scraping process as many as 490 thesis documents for students of the Faculty of Computer Science, University of Jember. The preprocessing stage is carried out by text mining methods which include cleaning, filtering, stemming, and tokenizing. Then calculate the weight of each word with the Term Frequency - Inverse Document Frequency algorithm, followed by the dimension reduction process using the Principal Component Analysis algorithm, which is normalized by Z-Score first. The outliers removal process is carried out before classifying documents. Furthermore, document grouping uses the K-Means Clustering method with Cosine Similarity as the distance calculation and the Silhouette Coefficient algorithm as a test. The test results were carried out with various k values and the optimal value was obtained at k = 2 with a Silhouette value of 0.80. Then the topic detection uses the Latent Dirichlet Allocation algorithm for each cluster that has been formed. Each cluster is visualized with a line chart and Trend Linear algorithm and analyzed to find out the trend. From the results of the analysis, it can be concluded that the topic of Decision Support System Development is trending down, and the topic of IT Performance Measurement and Forecasting is trending up. It can be concluded that the topic of Decision Support System Development needs to be reduced so that other topics can emerge.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.