The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2018
DOI: 10.1186/s40537-018-0163-2
|View full text |Cite
|
Sign up to set email alerts
|

Pairwise document similarity measure based on present term set

Abstract: IntroductionIn text mining, a similarity (or distance) measure is the quintessential way to calculate the similarity between two text documents, and is widely used in various Machine Learning (ML) methods, including clustering and classification. ML methods help learn from enormous collections, known as big data [1,2]. In big data, which includes masses of unstructured data, Information Retrieval (IR) is the dominant form of information access [3]. Among ML methods, classification and clustering help discover … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
13
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 39 publications
(13 citation statements)
references
References 37 publications
0
13
0
Order By: Relevance
“…In our work, Euclidean was seen more efficient when run on web-kb and Reuters, though, That is due to the Manhattan is being contingent upon the rotation of the coordinate system, leading to its being disadvantageous for both document classification and clustering tasks ( Kumar, Chhabra & Kumar, 2014 ). Meanwhile, Jaccard, Ex-Jaccard, Kullback–Leibler divergences (KLD), and Bhattacharya coefficient were all used for the several tasks of ML and IR including text clustering and classification ( Amer & Abdalla, 2020 ; Tanimoto, 1957 ; Tata & Patel, 2007 ; Oghbaie & Mohammadi Zanjireh, 2018 , François, Wertz & Verieysen, 2007 ; D’hondt et al, 2010 ; Li et al, 2017 ; Kullback & Leibler, 1951 ).…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…In our work, Euclidean was seen more efficient when run on web-kb and Reuters, though, That is due to the Manhattan is being contingent upon the rotation of the coordinate system, leading to its being disadvantageous for both document classification and clustering tasks ( Kumar, Chhabra & Kumar, 2014 ). Meanwhile, Jaccard, Ex-Jaccard, Kullback–Leibler divergences (KLD), and Bhattacharya coefficient were all used for the several tasks of ML and IR including text clustering and classification ( Amer & Abdalla, 2020 ; Tanimoto, 1957 ; Tata & Patel, 2007 ; Oghbaie & Mohammadi Zanjireh, 2018 , François, Wertz & Verieysen, 2007 ; D’hondt et al, 2010 ; Li et al, 2017 ; Kullback & Leibler, 1951 ).…”
Section: Related Workmentioning
confidence: 99%
“…To tackle such challenges, in IR literature, a dozen of works have introduced several effective similarity measures for text clustering and classification ( Amer & Abdalla, 2020 ; Oghbaie & Mohammadi Zanjireh, 2018 ; Sohangir & Wang, 2017 ; Lin, Jiang & Lee, 2014 ; Shahmirzadi, Lugowski & Younge, 2019 ; Ke, 2017 ; White & Jose, 2004 ; Lakshmi & Baskar, 2021 ; Kotte, Rajavelu & Rajsingh, 2020 ; Thompson, Panchev & Oakes, 2015 ). However, except for Amer & Abdalla (2020) , these studies proposed similarity measures without providing sufficient insights into run-time efficiency.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The VAT (Bezdek and Hathaway 2002 ) and cVAT (Prasad et al 2019b , a ) use the Euclidean and cosine measures while evaluating the clustering tendency of tweets data. Cosine accurately measures the similarity features for the text documents (Oghbaie and Zanjireh 2018 ) than Euclidean. Thus, for the assessment of clusters, cVAT performs better than VAT.…”
Section: Introductionmentioning
confidence: 99%
“…The most commonly used measure is Euclidean distance, which shows poor results in high dimensionality document clustering. In this paper, a novel cosine based internal and external validity metrics proposed for internally evaluating the results of a document clustering by considering into account the peculiarity of textual data [18], the closeness between documents [19], considering the lexical similarity [20], and also considered cluster classification metrics in the classification of elements in the cluster are well classified or not. Experimentally evaluated the effectiveness of proposed cluster validity metrics with benchmark and Twitter-based datasets.…”
Section: Introductionmentioning
confidence: 99%