Highlights:Graphical/Tabular Abstract Fully online data stream clustering Evolutionary based clustering Adaptive radius Time based summarization Memory for past status of clusters Figure A. Comparision of clustering quality and run-time complexity of algorithms on KDD datasetPurpose: The aim of this article to propose a new data stream clustering algorithm, which has an adaptive radius, can adapt itself to the evolutionary structure of streaming data and works in a fully online manner. Theory and Methods:In this study, kd-tree is used to forming and splitting clusters, adaptive radius approach is used to support increasing and decreasing the size of clusters, active/inactive status of clusters is used to adapt to the evolutionary structure of streaming data and all the operations are done online. In order to create a new cluster, the data that does not belong to any cluster are placed in a kd-tree, and the rangesearch operation is performed on those data according to predefined variables r (the radius of candidate cluster) and N (the number of data must be in the area). After forming the clusters, the radius of each cluster could be increased or decreased over time if necessary. Some clusters may be split and some may be merged over time because of dynamically changing structure of streaming data. Inactivation and reactivation of the status of clusters is used to allow for the identification of clusters formed in the same region at a different time interval with same cluster labels in accordance with the nature of the streaming data contrary to literature. This feature increases clustering quality of the proposed method. A summarization method that consist of time window and sliding window is used to support time based summarization without reduce performance. Results:To verify the effectiveness of KD-AR Stream algorithm, it is compared with SE-Stream, DPStream, and CEDAS on a variety of well-known datasets in terms of clustering quality and run-time complexity. The results show that KD-AR Stream outperforms other algorithms with a higher clustering success in a reasonable time as shown in Fig. A. Conclusion:The aim of this study is to propose a novel data stream clustering algorithm that adapts to the dynamic structure of the streaming data. The aim achieved by using the five evolutionary process which are appearance, activation/inactivation, self-evolution, merge, and split. According to the results, the proposed method is very successful in terms of clustering quality and run-time complexity.
Öz Günümüz teknolojisinin gelişmesine paralel olarak bilgisayar ortamına aktarılmış olan veri miktarı inanılmaz boyutlara ulaşmış ve gün geçtikçe de artmaktadır. Bu nedenle veriyi işleme yöntemleri de değişmektedir. Klasik kümeleme yaklaşımlarında veri statiktir. Oysa günümüz teknolojisinde, verinin çok hızlı olduğu dünyada artık veriyi akarken kümeleyecek, kullanıcıya istediği zaman sonuç verebilecek uygulamalara ihtiyaç vardır. Bu anlamda ihtiyacı karşılayan akan veri kümeleme yaklaşımlarına olan talep gün geçtikçe artmaktadır. Çünkü akan veri kümeleme yaklaşımları bir defa okumalı, hızlı ve kendisini yeni gelen veriye uyarlama özelliğine sahiptir. Yani veri bir yandan akarken bir yandan kullanıcıya sonuç üretilebilmektedir. Bu çalışmada akan veri kümeleme alanında yapılan çalışmalar derlenmekte ve bu alana ilgi duyan araştırmacılara ışık tutulmaktır.
Supervised machine learning techniques are commonly used in many areas like finance, education, healthcare, engineering, etc. because of their ability to learn from past data. However, such techniques can be very slow if the dataset is high-dimensional, and also irrelevant features may reduce classification success. Therefore, feature selection or feature reduction techniques are commonly used to overcome the mentioned issues. On the other hand, information security for both people and networks is crucial, and it must be secured without wasting the time. Hence, feature selection approaches that can make the algorithms faster without reducing the classification success are needed. In this study, we compare both the classification success and run-time performance of state-of-the-art classification algorithms using standard deviation-based feature selection in the aspect of security datasets. For this purpose, we applied standard deviation-based feature selection to KDD Cup 99 and Phishing Legitimate datasets for selecting the most relevant features, and then we run the selected classification algorithms on the datasets to compare the results. According to the obtained results, while the classification success of all algorithms is satisfying Decision Tree (DT) was the best one among others. On the other hand, while Decision Tree, k Nearest Neighbors, and Naïve Bayes (BN) were sufficiently fast, Support Vector Machine (SVM) and Artificial Neural Networks (ANN or NN) were too slow.
Teknolojideki gelişmeler, insanların pek çok farklı kaynakta üretilen verileri toplamasına ve analiz etmesine imkân tanımıştır. Sensörler, mobil cihazlar, nesnelerin interneti gibi yapılarda üretilen veriler akan veri formatında olup, bu tür verilerden işlenerek faydalı bilgilerin elde edilmesi zor bir problemdir. Akan verileri analiz etmek için sıklıkla kullanılan yöntemlerden birisi olan kümelemede, veriler dağılımlarına göre çeşitli gruplara ayrılarak analiz edilir. Bu çalışmada, akan veri kümeleme problemi için iki yeni algoritma geliştirilerek literatürdeki başka bir yöntemle karşılaştırılmıştır. Farklı veri kümeleri üzerinde yapılan deneyler neticesinde, geliştirilen algoritmaların iyi sonuçlar verdiği görülmüştür.
The cluster evaluation process is of great importance in areas of machine learning and data mining. Evaluating the clustering quality of clusters shows how much any proposed approach or algorithm is competent. Nevertheless, evaluating the quality of any cluster is still an issue. Although many cluster validity indices have been proposed, there is a need for new approaches that can measure the clustering quality more accurately because most of the existing approaches measure the cluster quality correctly when the shape of the cluster is spherical. However, very few clusters in the real world are spherical. Therefore, a new Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (the VIASCKDE Index) to overcome the mentioned issue was proposed in the study. In the VIASCKDE Index, we used separation and compactness of each data to support arbitrary-shaped clusters and utilized the kernel density estimation (KDE) to give more weight to the denser areas in the clusters to support cluster compactness. To evaluate the performance of our approach, we compared it to the state-of-the-art cluster validity indices. Experimental results have demonstrated that the VIASCKDE Index outperforms the compared indices.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.