State of the art document clustering algorithms based on semantic similarity

Jacksi, Karwan; Salih, Niyaz

doi:10.26555/jifo.v14i2.a17513

Cited by 19 publications

(9 citation statements)

References 35 publications

(65 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Another close problem is to find clusters of related news articles such that a cluster has a high internal coherence ( i.e ., having articles from the same topic or word distribution), but different from other clusters ( Bisandu, Prasad & Liman, 2018 ; Salih & Jacksi, 2020 ; Khan et al, 2018 ). While articles within the cluster might be candidate background to each other, the cluster size is big, and selecting specific articles from the cluster to present to the reader of a specific query article is again a challenging problem.…”

Section: Background Linking Problemmentioning

confidence: 99%

Unsupervised query reduction for efficient yet effective news background linking

Essam¹,

Elsayed²

2023

PeerJ Computer Science

View full text Add to dashboard Cite

In this article, we study efficient techniques to tackle the news background linking problem, in which an online reader seeks background knowledge about a given article to better understand its context. Recently, this problem attracted many researchers, especially in the Text Retrieval Conference (TREC) community. Surprisingly, the most effective method to date uses the entire input news article as a search query in an ad-hoc retrieval approach to retrieve the background links. In a scenario where the lookup for background links is performed online, this method becomes inefficient, especially if the search scope is big such as the Web, due to the relatively long generated query, which results in a long response time. In this work, we evaluate different unsupervised approaches for reducing the input news article to a much shorter, hence efficient, search query, while maintaining the retrieval effectiveness. We conducted several experiments using the Washington Post dataset, released specifically for the news background linking problem. Our results show that a simple statistical analysis of the article using a recent keyword extraction technique reaches an average of 6.2× speedup in query response time over the full article approach, with no significant difference in effectiveness. Moreover, we found that further reduction of the search terms can be achieved by eliminating relatively low TF-IDF values from the search queries, yielding even more efficient retrieval of 13.3× speedup, while still maintaining the retrieval effectiveness. This makes our approach more suitable for practical online scenarios. Our study is the first to address the efficiency of news background linking systems. We, therefore, release our source code to promote research in that direction.

show abstract

Section: Background Linking Problemmentioning

confidence: 99%

Unsupervised query reduction for efficient yet effective news background linking

Essam¹,

Elsayed²

2023

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…The effectiveness of word embeddings in projecting keyword relationships as well as their performance in topic modeling [32,33], clustering [34], and document classification [35,36] consistently inspire researchers to propose methodologies that summarize word vectors into detailed structures reflected on textual topics [37,38]. In general, the effect, evaluation, and selection of the different techniques in cluster analysis and topic extraction vary in relevant experiments [39,40]. Some of the standard methodologies that utilize word vectors for topic extraction are the Gaussian Mixture Models (GMM) [32,41], Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [13,42], hard and Fuzzy K-Means [34,43], and more complex approaches that are based on textual similarities [40].…”

Section: Cluster Analysis and Word Embeddingsmentioning

confidence: 99%

“…In general, the effect, evaluation, and selection of the different techniques in cluster analysis and topic extraction vary in relevant experiments [39,40]. Some of the standard methodologies that utilize word vectors for topic extraction are the Gaussian Mixture Models (GMM) [32,41], Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [13,42], hard and Fuzzy K-Means [34,43], and more complex approaches that are based on textual similarities [40].…”

Section: Cluster Analysis and Word Embeddingsmentioning

confidence: 99%

Exploitation of Vulnerabilities: A Topic-Based Machine Learning Framework for Explaining and Predicting Exploitation

2023

View full text Add to dashboard Cite

Security vulnerabilities constitute one of the most important weaknesses of hardware and software security that can cause severe damage to systems, applications, and users. As a result, software vendors should prioritize the most dangerous and impactful security vulnerabilities by developing appropriate countermeasures. As we acknowledge the importance of vulnerability prioritization, in the present study, we propose a framework that maps newly disclosed vulnerabilities with topic distributions, via word clustering, and further predicts whether this new entry will be associated with a potential exploit Proof Of Concept (POC). We also provide insights on the current most exploitable weaknesses and products through a Generalized Linear Model (GLM) that links the topic memberships of vulnerabilities with exploit indicators, thus distinguishing five topics that are associated with relatively frequent recent exploits. Our experiments show that the proposed framework can outperform two baseline topic modeling algorithms in terms of topic coherence by improving LDA models by up to 55%. In terms of classification performance, the conducted experiments—on a quite balanced dataset (57% negative observations, 43% positive observations)—indicate that the vulnerability descriptions can be used as exclusive features in assessing the exploitability of vulnerabilities, as the “best” model achieves accuracy close to 87%. Overall, our study contributes to enabling the prioritization of vulnerabilities by providing guidelines on the relations between the textual details of a weakness and the potential application/system exploits.

show abstract

“…This process is carried out on a group of things that have been gathered together [11][12][13]. Clustering is a technique that can be used to organize data structures into a number of groups that are incompatible with one another and are referred to collectively as clusters [14][15][16]. Clustering is a strategy that may be employed.…”

Section: Introductionmentioning

confidence: 99%

Document Clustering in the Age of Big Data: Incorporating Semantic Information for Improved Results

Haji¹,

Al-Zebari²,

Şengür³

et al. 2023

JASTT

View full text Add to dashboard Cite

There has been a meteoric rise in the total amount of digital texts as a direct result of the proliferation of internet access. As a direct result of this, document clustering has evolved into a crucial method that must be used in order to successfully extract relevant information from big document collections. When employing the document clustering approach, documents are automatically sorted into groups whose members have a high degree of similarity to one another. These groups are created by applying the document clustering technique. Because they do not take into account the semantic linkages that exist between the texts, traditional clustering approaches are unable to provide an acceptable description of a collection of texts. This is because traditional clustering techniques. Document clusters, in which texts are ordered according to their meaning rather than their use of keywords, have been extensively utilized as a means of overcoming these challenges as a result of the incorporation of semantic information. This has been possible as a result of the fact that document clusters can group together related texts. In this investigation, we looked at a total of 27 distinct papers that were published over the previous five years and categorized the documents based on the semantic similarities that existed between the various pieces. A detailed literature evaluation is included to each and every one of the publications that were selected for further consideration. Comparative research is carried out on a wide variety of evaluation strategies, including as algorithms, similarity metrics, instruments, and processes. Following that, there is a drawn-out discussion that analyzes the similarities and differences between the activities.

show abstract

State of the art document clustering algorithms based on semantic similarity

Cited by 19 publications

References 35 publications

Unsupervised query reduction for efficient yet effective news background linking

Unsupervised query reduction for efficient yet effective news background linking

Exploitation of Vulnerabilities: A Topic-Based Machine Learning Framework for Explaining and Predicting Exploitation

Document Clustering in the Age of Big Data: Incorporating Semantic Information for Improved Results

Contact Info

Product

Resources

About