Abstract:IntroductionIn text mining, a similarity (or distance) measure is the quintessential way to calculate the similarity between two text documents, and is widely used in various Machine Learning (ML) methods, including clustering and classification. ML methods help learn from enormous collections, known as big data [1,2]. In big data, which includes masses of unstructured data, Information Retrieval (IR) is the dominant form of information access [3]. Among ML methods, classification and clustering help discover … Show more
“…In our work, Euclidean was seen more efficient when run on web-kb and Reuters, though, That is due to the Manhattan is being contingent upon the rotation of the coordinate system, leading to its being disadvantageous for both document classification and clustering tasks ( Kumar, Chhabra & Kumar, 2014 ). Meanwhile, Jaccard, Ex-Jaccard, Kullback–Leibler divergences (KLD), and Bhattacharya coefficient were all used for the several tasks of ML and IR including text clustering and classification ( Amer & Abdalla, 2020 ; Tanimoto, 1957 ; Tata & Patel, 2007 ; Oghbaie & Mohammadi Zanjireh, 2018 , François, Wertz & Verieysen, 2007 ; D’hondt et al, 2010 ; Li et al, 2017 ; Kullback & Leibler, 1951 ).…”
Section: Related Workmentioning
confidence: 99%
“…To tackle such challenges, in IR literature, a dozen of works have introduced several effective similarity measures for text clustering and classification ( Amer & Abdalla, 2020 ; Oghbaie & Mohammadi Zanjireh, 2018 ; Sohangir & Wang, 2017 ; Lin, Jiang & Lee, 2014 ; Shahmirzadi, Lugowski & Younge, 2019 ; Ke, 2017 ; White & Jose, 2004 ; Lakshmi & Baskar, 2021 ; Kotte, Rajavelu & Rajsingh, 2020 ; Thompson, Panchev & Oakes, 2015 ). However, except for Amer & Abdalla (2020) , these studies proposed similarity measures without providing sufficient insights into run-time efficiency.…”
Section: Introductionmentioning
confidence: 99%
“…In other words, these studies might introduce effective measures yet time-inefficient. Moreover, these measures, which are shown effective ( Amer & Abdalla, 2020 ; Oghbaie & Mohammadi Zanjireh, 2018 ; Sohangir & Wang, 2017 ; Lin, Jiang & Lee, 2014 ; Shahmirzadi, Lugowski & Younge, 2019 ; Lakshmi & Baskar, 2021 ; Robertson, 2004 ) suffer from design complexity. Motivated by this, this work comes with the ultimate aim of finding an influential solution to the efficiency as well as the design complexity of those similarity measures.…”
In Information Retrieval (IR), Data Mining (DM), and Machine Learning (ML), similarity measures have been widely used for text clustering and classification. The similarity measure is the cornerstone upon which the performance of most DM and ML algorithms is completely dependent. Thus, till now, the endeavor in literature for an effective and efficient similarity measure is still immature. Some recently-proposed similarity measures were effective, but have a complex design and suffer from inefficiencies. This work, therefore, develops an effective and efficient similarity measure of a simplistic design for text-based applications. The measure developed in this work is driven by Boolean logic algebra basics (BLAB-SM), which aims at effectively reaching the desired accuracy at the fastest run time as compared to the recently developed state-of-the-art measures. Using the term frequency–inverse document frequency (TF-IDF) schema, the K-nearest neighbor (KNN), and the K-means clustering algorithm, a comprehensive evaluation is presented. The evaluation has been experimentally performed for BLAB-SM against seven similarity measures on two most-popular datasets, Reuters-21 and Web-KB. The experimental results illustrate that BLAB-SM is not only more efficient but also significantly more effective than state-of-the-art similarity measures on both classification and clustering tasks.
“…In our work, Euclidean was seen more efficient when run on web-kb and Reuters, though, That is due to the Manhattan is being contingent upon the rotation of the coordinate system, leading to its being disadvantageous for both document classification and clustering tasks ( Kumar, Chhabra & Kumar, 2014 ). Meanwhile, Jaccard, Ex-Jaccard, Kullback–Leibler divergences (KLD), and Bhattacharya coefficient were all used for the several tasks of ML and IR including text clustering and classification ( Amer & Abdalla, 2020 ; Tanimoto, 1957 ; Tata & Patel, 2007 ; Oghbaie & Mohammadi Zanjireh, 2018 , François, Wertz & Verieysen, 2007 ; D’hondt et al, 2010 ; Li et al, 2017 ; Kullback & Leibler, 1951 ).…”
Section: Related Workmentioning
confidence: 99%
“…To tackle such challenges, in IR literature, a dozen of works have introduced several effective similarity measures for text clustering and classification ( Amer & Abdalla, 2020 ; Oghbaie & Mohammadi Zanjireh, 2018 ; Sohangir & Wang, 2017 ; Lin, Jiang & Lee, 2014 ; Shahmirzadi, Lugowski & Younge, 2019 ; Ke, 2017 ; White & Jose, 2004 ; Lakshmi & Baskar, 2021 ; Kotte, Rajavelu & Rajsingh, 2020 ; Thompson, Panchev & Oakes, 2015 ). However, except for Amer & Abdalla (2020) , these studies proposed similarity measures without providing sufficient insights into run-time efficiency.…”
Section: Introductionmentioning
confidence: 99%
“…In other words, these studies might introduce effective measures yet time-inefficient. Moreover, these measures, which are shown effective ( Amer & Abdalla, 2020 ; Oghbaie & Mohammadi Zanjireh, 2018 ; Sohangir & Wang, 2017 ; Lin, Jiang & Lee, 2014 ; Shahmirzadi, Lugowski & Younge, 2019 ; Lakshmi & Baskar, 2021 ; Robertson, 2004 ) suffer from design complexity. Motivated by this, this work comes with the ultimate aim of finding an influential solution to the efficiency as well as the design complexity of those similarity measures.…”
In Information Retrieval (IR), Data Mining (DM), and Machine Learning (ML), similarity measures have been widely used for text clustering and classification. The similarity measure is the cornerstone upon which the performance of most DM and ML algorithms is completely dependent. Thus, till now, the endeavor in literature for an effective and efficient similarity measure is still immature. Some recently-proposed similarity measures were effective, but have a complex design and suffer from inefficiencies. This work, therefore, develops an effective and efficient similarity measure of a simplistic design for text-based applications. The measure developed in this work is driven by Boolean logic algebra basics (BLAB-SM), which aims at effectively reaching the desired accuracy at the fastest run time as compared to the recently developed state-of-the-art measures. Using the term frequency–inverse document frequency (TF-IDF) schema, the K-nearest neighbor (KNN), and the K-means clustering algorithm, a comprehensive evaluation is presented. The evaluation has been experimentally performed for BLAB-SM against seven similarity measures on two most-popular datasets, Reuters-21 and Web-KB. The experimental results illustrate that BLAB-SM is not only more efficient but also significantly more effective than state-of-the-art similarity measures on both classification and clustering tasks.
“…The VAT (Bezdek and Hathaway 2002 ) and cVAT (Prasad et al 2019b , a ) use the Euclidean and cosine measures while evaluating the clustering tendency of tweets data. Cosine accurately measures the similarity features for the text documents (Oghbaie and Zanjireh 2018 ) than Euclidean. Thus, for the assessment of clusters, cVAT performs better than VAT.…”
“…The most commonly used measure is Euclidean distance, which shows poor results in high dimensionality document clustering. In this paper, a novel cosine based internal and external validity metrics proposed for internally evaluating the results of a document clustering by considering into account the peculiarity of textual data [18], the closeness between documents [19], considering the lexical similarity [20], and also considered cluster classification metrics in the classification of elements in the cluster are well classified or not. Experimentally evaluated the effectiveness of proposed cluster validity metrics with benchmark and Twitter-based datasets.…”
Text data clustering is performed for organizing the set of text documents into the desired number of coherent and meaningful sub-clusters. Modeling the text documents in terms of topics derivations is a vital task in text data clustering. Each tweet is considered as a text document, and various topic models perform modeling of tweets. In existing topic models, the clustering tendency of tweets is assessed initially based on Euclidean dissimilarity features. Cosine metric is more suitable for more informative assessment, especially of text clustering. Thus, this paper develops a novel cosine based external and interval validity assessment of cluster tendency for improving the computational efficiency of tweets data clustering. In the experimental, tweets data clustering results are evaluated using cluster validity indices measures. Experimentally proved that cosine based internal and external validity metrics outperforms the other using benchmarked and Twitter-based datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.