A set theory based similarity measure for text clustering and classification

Amer, Ali A.; Abdalla, Hassan I.

doi:10.1186/s40537-020-00344-3

Cited by 28 publications

(26 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several machine learning techniques have demonstrated a surpassing performance, in the NLP field, to handle the voluminous constantly-piling data and information on the internet. Among these techniques are clustering and classification which are still commonly used in almost all scientific fields, including text mining, information retrieval, web search, pattern recognition, and biomedical based text mining ( Amer & Abdalla, 2020 ; Rachkovskij, 2017 ; Gweon, Schonlau & Steiner, 2019b ; Kanungo et al, 2002 ; Holzinger et al, 2014 ). For example, in Holzinger et al (2014) , a detailed survey in biomedical-based text mining and classification was done, while stressing the importance of involving and improving similarity measures for classification tasks.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Boolean logic algebra driven similarity measure for text based applications

Abdalla

Amer

2021

PeerJ Computer Science

Self Cite

View full text Add to dashboard Cite

In Information Retrieval (IR), Data Mining (DM), and Machine Learning (ML), similarity measures have been widely used for text clustering and classification. The similarity measure is the cornerstone upon which the performance of most DM and ML algorithms is completely dependent. Thus, till now, the endeavor in literature for an effective and efficient similarity measure is still immature. Some recently-proposed similarity measures were effective, but have a complex design and suffer from inefficiencies. This work, therefore, develops an effective and efficient similarity measure of a simplistic design for text-based applications. The measure developed in this work is driven by Boolean logic algebra basics (BLAB-SM), which aims at effectively reaching the desired accuracy at the fastest run time as compared to the recently developed state-of-the-art measures. Using the term frequency–inverse document frequency (TF-IDF) schema, the K-nearest neighbor (KNN), and the K-means clustering algorithm, a comprehensive evaluation is presented. The evaluation has been experimentally performed for BLAB-SM against seven similarity measures on two most-popular datasets, Reuters-21 and Web-KB. The experimental results illustrate that BLAB-SM is not only more efficient but also significantly more effective than state-of-the-art similarity measures on both classification and clustering tasks.

show abstract

Section: Introductionmentioning

confidence: 99%

“…Generally speaking, in information retrieval, the documents are drawn as vectors in the vector space model (VSM) ( Amer & Abdalla, 2020 ). In each document’s vector, each cell refers to the value of the relative feature that corresponds to the term presence/absence.…”

Section: Introductionmentioning

confidence: 99%

Boolean logic algebra driven similarity measure for text based applications

Abdalla

Amer

2021

PeerJ Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…where l and l´ denote the labels of entities e and e, W denotes WordNet, syn(l) and ant(l) denote the set of synonyms and antonyms of l, Lin(l, l) denotes the information theory-based text similarity proposed by [48], and tok(l) denotes the set of words corresponding to the entity label. For example, the set of words corresponding to 'bookTitle' is {{book}, {title}}.…”

Section: A Analysis Of Present Situationmentioning

confidence: 99%

Ontology Matching: State of the Art, Future Challenges, and Thinking Based on Utilized Information

et al. 2021

View full text Add to dashboard Cite

Information used in existing ontology matching solutions are usually grouped into four categories: lexical information, structural information, semantic information, and external information, respectively. By summarizing and analyzing the approaches for utilizing the same kind of information, this paper finds that lexical information is mainly analyzed based on text and dictionary similarity. Similarly, structural information and semantic information are mainly analyzed based on graph structure and reasoner, respectively. The approaches for aggregating information analysis results are discussed. Challenges in the analysis of various types of information for existing ontology matching solutions are also described, and insights into directions for future research are provided.

show abstract

“…Cosine and Jaccard similarity techniques are the two text-based similarity approach which has been widely incorporated for finding similar text ( Sohangir & Wang, 2017 ; Amer & Abdalla, 2020 ). But these approaches, when applied to question-based corpus for identifying similar question text, lead to the recommendation issues, as discussed in the following subsections.…”

Section: Guiding the Learner To The Probable Correct Questionmentioning

confidence: 99%

Learner question’s correctness assessment and a guided correction method: enhancing the user experience in an interactive online learning system

Pal

Pramanik

Maity

et al. 2021

PeerJ Computer Science

View full text Add to dashboard Cite

In an interactive online learning system (OLS), it is crucial for the learners to form the questions correctly in order to be provided or recommended appropriate learning materials. The incorrect question formation may lead the OLS to be confused, resulting in providing or recommending inappropriate study materials, which, in turn, affects the learning quality and experience and learner satisfaction. In this paper, we propose a novel method to assess the correctness of the learner's question in terms of syntax and semantics. Assessing the learner’s query precisely will improve the performance of the recommendation. A tri-gram language model is built, and trained and tested on corpora of 2,533 and 634 questions on Java, respectively, collected from books, blogs, websites, and university exam papers. The proposed method has exhibited 92% accuracy in identifying a question as correct or incorrect. Furthermore, in case the learner's input question is not correct, we propose an additional framework to guide the learner leading to a correct question that closely matches her intended question. For recommending correct questions, soft cosine based similarity is used. The proposed framework is tested on a group of learners' real-time questions and observed to accomplish 85% accuracy.

show abstract

A set theory based similarity measure for text clustering and classification

Cited by 28 publications

References 27 publications

Boolean logic algebra driven similarity measure for text based applications

Boolean logic algebra driven similarity measure for text based applications

Ontology Matching: State of the Art, Future Challenges, and Thinking Based on Utilized Information

Learner question’s correctness assessment and a guided correction method: enhancing the user experience in an interactive online learning system

Contact Info

Product

Resources

About