Pairwise document similarity measure based on present term set

Oghbaie, Marzieh; Zanjireh, Morteza Mohammadi

doi:10.1186/s40537-018-0163-2

Cited by 39 publications

(13 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our work, Euclidean was seen more efficient when run on web-kb and Reuters, though, That is due to the Manhattan is being contingent upon the rotation of the coordinate system, leading to its being disadvantageous for both document classification and clustering tasks ( Kumar, Chhabra & Kumar, 2014 ). Meanwhile, Jaccard, Ex-Jaccard, Kullback–Leibler divergences (KLD), and Bhattacharya coefficient were all used for the several tasks of ML and IR including text clustering and classification ( Amer & Abdalla, 2020 ; Tanimoto, 1957 ; Tata & Patel, 2007 ; Oghbaie & Mohammadi Zanjireh, 2018 , François, Wertz & Verieysen, 2007 ; D’hondt et al, 2010 ; Li et al, 2017 ; Kullback & Leibler, 1951 ).…”

Section: Related Workmentioning

confidence: 99%

“…To tackle such challenges, in IR literature, a dozen of works have introduced several effective similarity measures for text clustering and classification ( Amer & Abdalla, 2020 ; Oghbaie & Mohammadi Zanjireh, 2018 ; Sohangir & Wang, 2017 ; Lin, Jiang & Lee, 2014 ; Shahmirzadi, Lugowski & Younge, 2019 ; Ke, 2017 ; White & Jose, 2004 ; Lakshmi & Baskar, 2021 ; Kotte, Rajavelu & Rajsingh, 2020 ; Thompson, Panchev & Oakes, 2015 ). However, except for Amer & Abdalla (2020) , these studies proposed similarity measures without providing sufficient insights into run-time efficiency.…”

Section: Introductionmentioning

confidence: 99%

“…In other words, these studies might introduce effective measures yet time-inefficient. Moreover, these measures, which are shown effective ( Amer & Abdalla, 2020 ; Oghbaie & Mohammadi Zanjireh, 2018 ; Sohangir & Wang, 2017 ; Lin, Jiang & Lee, 2014 ; Shahmirzadi, Lugowski & Younge, 2019 ; Lakshmi & Baskar, 2021 ; Robertson, 2004 ) suffer from design complexity. Motivated by this, this work comes with the ultimate aim of finding an influential solution to the efficiency as well as the design complexity of those similarity measures.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Boolean logic algebra driven similarity measure for text based applications

Abdalla

Amer

2021

PeerJ Computer Science

View full text Add to dashboard Cite

In Information Retrieval (IR), Data Mining (DM), and Machine Learning (ML), similarity measures have been widely used for text clustering and classification. The similarity measure is the cornerstone upon which the performance of most DM and ML algorithms is completely dependent. Thus, till now, the endeavor in literature for an effective and efficient similarity measure is still immature. Some recently-proposed similarity measures were effective, but have a complex design and suffer from inefficiencies. This work, therefore, develops an effective and efficient similarity measure of a simplistic design for text-based applications. The measure developed in this work is driven by Boolean logic algebra basics (BLAB-SM), which aims at effectively reaching the desired accuracy at the fastest run time as compared to the recently developed state-of-the-art measures. Using the term frequency–inverse document frequency (TF-IDF) schema, the K-nearest neighbor (KNN), and the K-means clustering algorithm, a comprehensive evaluation is presented. The evaluation has been experimentally performed for BLAB-SM against seven similarity measures on two most-popular datasets, Reuters-21 and Web-KB. The experimental results illustrate that BLAB-SM is not only more efficient but also significantly more effective than state-of-the-art similarity measures on both classification and clustering tasks.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Boolean logic algebra driven similarity measure for text based applications

Abdalla

Amer

2021

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…The VAT (Bezdek and Hathaway 2002 ) and cVAT (Prasad et al 2019b , a ) use the Euclidean and cosine measures while evaluating the clustering tendency of tweets data. Cosine accurately measures the similarity features for the text documents (Oghbaie and Zanjireh 2018 ) than Euclidean. Thus, for the assessment of clusters, cVAT performs better than VAT.…”

Section: Introductionmentioning

confidence: 99%

An effective assessment of cluster tendency through sampling based multi-viewpoints visual method

Prasad¹,

Reddy

Mohammed

2021

J Ambient Intell Human Comput

View full text Add to dashboard Cite

“…The most commonly used measure is Euclidean distance, which shows poor results in high dimensionality document clustering. In this paper, a novel cosine based internal and external validity metrics proposed for internally evaluating the results of a document clustering by considering into account the peculiarity of textual data [18], the closeness between documents [19], considering the lexical similarity [20], and also considered cluster classification metrics in the classification of elements in the cluster are well classified or not. Experimentally evaluated the effectiveness of proposed cluster validity metrics with benchmark and Twitter-based datasets.…”

Section: Introductionmentioning

confidence: 99%

Assessment of Twitter Data Clusters with Cosine-Based Validation Metrics Using Hybrid Topic Models

Mohammed¹,

Mohammed²

2020

ISI

View full text Add to dashboard Cite

Text data clustering is performed for organizing the set of text documents into the desired number of coherent and meaningful sub-clusters. Modeling the text documents in terms of topics derivations is a vital task in text data clustering. Each tweet is considered as a text document, and various topic models perform modeling of tweets. In existing topic models, the clustering tendency of tweets is assessed initially based on Euclidean dissimilarity features. Cosine metric is more suitable for more informative assessment, especially of text clustering. Thus, this paper develops a novel cosine based external and interval validity assessment of cluster tendency for improving the computational efficiency of tweets data clustering. In the experimental, tweets data clustering results are evaluated using cluster validity indices measures. Experimentally proved that cosine based internal and external validity metrics outperforms the other using benchmarked and Twitter-based datasets.

show abstract

Pairwise document similarity measure based on present term set

Cited by 39 publications

References 37 publications

Boolean logic algebra driven similarity measure for text based applications

Boolean logic algebra driven similarity measure for text based applications

An effective assessment of cluster tendency through sampling based multi-viewpoints visual method

Assessment of Twitter Data Clusters with Cosine-Based Validation Metrics Using Hybrid Topic Models

Contact Info

Product

Resources

About