What is this Cluster about? Explaining textual clusters by extracting relevant keywords

Penta, Antonio; Pal, Anandita

doi:10.1016/j.knosys.2021.107342

Cited by 4 publications

(2 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, this step partitions aspects into K clusters based on their semantic similarity using Google's pretrained Word2vec model and the K-means clustering method. K-means is a widely used distance/centroid-based algorithm, where distances are determined in order to allocate a point to a cluster [53]. The K-means algorithm associates each cluster with a centroid and aims to minimise the sum of the distances between the cluster centroid and the points assigned to the cluster.…”

Section: ) Extracting and Clustering Aspectsmentioning

confidence: 99%

To Cluster or Not to Cluster: The Impact of Clustering on the Performance of Aspect-Based Collaborative Filtering

et al. 2023

View full text Add to dashboard Cite

Collaborative filtering (CF) is one of the most widely utilised approaches in recommendation techniques. It suggests items to users based on the ratings of other users who share their preferences. Thus, one of the aims of CF is to find reliable neighbours. Typically, CF produces a sparse user-item rating matrix, when relying only on the ratings to identify the precise neighbours, resulting in poor performance. User reviews can be essential in overcoming those situations because of the diverse elements available in reviews. The most popular element is aspects, which can provide a fine-grained analysis of users' behaviours, thus improving personalised recommendations. However, increasing the number of aspects also results in sparsity, therefore may deteriorate the recommendation performance. As a result, clustering of aspects may lessen this sparsity, but it is yet unclear how much this would affect th e performance of CF systems. This study proposes a CF approach based on aspect clustering that addresses the above issue in terms of rating prediction. The approach aims to reduce the sparseness in the multi-criteria rating matrix by grouping aspects into clusters based on their semantic similarity, which will be less expensive and require less memory to discover the neighbourhood set. Our approach extracts aspects and represents them using Google's pre-trained Word2vec model. Then, aspects are organised into clusters using the K-means clustering algorithm. Multi-dimensional Euclidean distance is used as a similarity measure for finding the appropriate neighbours and predicted ratings of unseen items are then made using the kNN algorithm. This study also identifies the number of aspects that significantly impacts CF performance. Experiments are carried out using a real large-scale dataset: the Amazon movie dataset. Evaluation is also performed by comparing CF performance of the proposed approach with three different baseline approaches. Results show that the proposed approach improves CF performance compared to other approaches in terms of three predictive accuracy metrics.

show abstract

Section: ) Extracting and Clustering Aspectsmentioning

confidence: 99%

To Cluster or Not to Cluster: The Impact of Clustering on the Performance of Aspect-Based Collaborative Filtering

et al. 2023

View full text Add to dashboard Cite

show abstract

“…Common approaches to convert text to vector representation include bag-of-words methods and TF-IDF (term frequency-inverse document frequency), word embedding models such as word2vec (Mikolov et al, 2013 ) and Global Vectors for Word Representation (GloVe) (Pennington et al, 2014 ), and transformer models like Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al, 2019 ). Summarizing and interpreting output document clusters can also be difficult due to the high-dimensionality of text-based data and is an active area of research (Afzali & Kumar, 2019 ; Penta & Pal, 2021 ).…”

Section: Introductionmentioning

confidence: 99%

Research proposal content extraction using natural language processing and semi-supervised clustering: A demonstration and comparative analysis

Knisely

Pavliscsak

2023

Scientometrics

View full text Add to dashboard Cite

Funding institutions often solicit text-based research proposals to evaluate potential recipients. Leveraging the information contained in these documents could help institutions understand the supply of research within their domain. In this work, an end-to-end methodology for semi-supervised document clustering is introduced to partially automate classification of research proposals based on thematic areas of interest. The methodology consists of three stages: (1) manual annotation of a document sample; (2) semi-supervised clustering of documents; (3) evaluation of cluster results using quantitative metrics and qualitative ratings (coherence, relevance, distinctiveness) by experts. The methodology is described in detail to encourage replication and is demonstrated on a real-world data set. This demonstration sought to categorize proposals submitted to the US Army Telemedicine and Advanced Technology Research Center (TATRC) related to technological innovations in military medicine. A comparative analysis of method features was performed, including unsupervised vs. semi-supervised clustering, several document vectorization techniques, and several cluster result selection strategies. Outcomes suggest that pretrained Bidirectional Encoder Representations from Transformers (BERT) embeddings were better suited for the task than older text embedding techniques. When comparing expert ratings between algorithms, semi-supervised clustering produced coherence ratings ~ 25% better on average compared to standard unsupervised clustering with negligible differences in cluster distinctiveness. Last, it was shown that a cluster result selection strategy that balances internal and external validity produced ideal results. With further refinement, this methodological framework shows promise as a useful analytical tool for institutions to unlock hidden insights from untapped archives and similar administrative document repositories. Supplementary Information The online version contains supplementary material available at 10.1007/s11192-023-04689-3.

show abstract

Towards ML Explainability with Rough Sets, Clustering, and Dimensionality Reduction

Grzegorowski,

Janusz,

Śliwa

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

What is this Cluster about? Explaining textual clusters by extracting relevant keywords

Cited by 4 publications

References 25 publications

To Cluster or Not to Cluster: The Impact of Clustering on the Performance of Aspect-Based Collaborative Filtering

To Cluster or Not to Cluster: The Impact of Clustering on the Performance of Aspect-Based Collaborative Filtering

Research proposal content extraction using natural language processing and semi-supervised clustering: A demonstration and comparative analysis

Towards ML Explainability with Rough Sets, Clustering, and Dimensionality Reduction

Contact Info

Product

Resources

About