2020
DOI: 10.1186/s40537-020-00344-3
|View full text |Cite
|
Sign up to set email alerts
|

A set theory based similarity measure for text clustering and classification

Abstract: Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
23
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
4

Relationship

2
7

Authors

Journals

citations
Cited by 28 publications
(26 citation statements)
references
References 27 publications
0
23
0
Order By: Relevance
“…Several machine learning techniques have demonstrated a surpassing performance, in the NLP field, to handle the voluminous constantly-piling data and information on the internet. Among these techniques are clustering and classification which are still commonly used in almost all scientific fields, including text mining, information retrieval, web search, pattern recognition, and biomedical based text mining ( Amer & Abdalla, 2020 ; Rachkovskij, 2017 ; Gweon, Schonlau & Steiner, 2019b ; Kanungo et al, 2002 ; Holzinger et al, 2014 ). For example, in Holzinger et al (2014) , a detailed survey in biomedical-based text mining and classification was done, while stressing the importance of involving and improving similarity measures for classification tasks.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Several machine learning techniques have demonstrated a surpassing performance, in the NLP field, to handle the voluminous constantly-piling data and information on the internet. Among these techniques are clustering and classification which are still commonly used in almost all scientific fields, including text mining, information retrieval, web search, pattern recognition, and biomedical based text mining ( Amer & Abdalla, 2020 ; Rachkovskij, 2017 ; Gweon, Schonlau & Steiner, 2019b ; Kanungo et al, 2002 ; Holzinger et al, 2014 ). For example, in Holzinger et al (2014) , a detailed survey in biomedical-based text mining and classification was done, while stressing the importance of involving and improving similarity measures for classification tasks.…”
Section: Introductionmentioning
confidence: 99%
“…Generally speaking, in information retrieval, the documents are drawn as vectors in the vector space model (VSM) ( Amer & Abdalla, 2020 ). In each document’s vector, each cell refers to the value of the relative feature that corresponds to the term presence/absence.…”
Section: Introductionmentioning
confidence: 99%
“…where l and l´ denote the labels of entities e and e, W denotes WordNet, syn(l) and ant(l) denote the set of synonyms and antonyms of l, Lin(l, l) denotes the information theory-based text similarity proposed by [48], and tok(l) denotes the set of words corresponding to the entity label. For example, the set of words corresponding to 'bookTitle' is {{book}, {title}}.…”
Section: A Analysis Of Present Situationmentioning
confidence: 99%
“…Cosine and Jaccard similarity techniques are the two text-based similarity approach which has been widely incorporated for finding similar text ( Sohangir & Wang, 2017 ; Amer & Abdalla, 2020 ). But these approaches, when applied to question-based corpus for identifying similar question text, lead to the recommendation issues, as discussed in the following subsections.…”
Section: Guiding the Learner To The Probable Correct Questionmentioning
confidence: 99%