Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Vahidnia, Sahand; Abbasi, Alireza; Abbass, Hussein A.

doi:10.2478/jdis-2021-0024

Cited by 6 publications

(3 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Doc2Vec [10] is a document representation of Word2Vec that takes the word order into account. Comparative studies by Radu et al (2020) [13] and Vahidnia et al (2021) [14] have shown that Doc2Vec models with off-the-shelf clustering algorithms such as K-means and DBSCAN [22] and deep embedded clustering [15] improve the accuracy of document clustering on scientific publications and outperform classical bag-of-words. However, for Word2Vec and Doc2Vec embeddings, only one vector is generated for a word, which fails to embed different senses of a word, for example, the word bank regardless of whether it is used in river bank and commercial bank.…”

Section: Pretrained Language Models and Applicationsmentioning

confidence: 99%

“…There are a number of document embedding methods, from Bag-of-words (BoW), Word2Vec [9], and Doc2Vec [10] to the most recent, transformed-based such as BERT (Bidirectional Encoder Representations from Transformers) [11] and the GPT-3 similarity embeddings [12]. Radu et al (2020) [13] and Vahidnia et al (2021) [14] experimented with Doc2Vec embedding with off-the-shelf clustering algorithms, such as K-means, hierarchical agglomerative clustering, and deep embedded clustering [15], on publication abstracts and then used the TF-IDF terms to label each cluster. Their results showed that the use of Doc2Vec embedding improves the accuracy of clustering algorithms.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods

Weng

Dyer

2022

Applied Sciences

View full text Add to dashboard Cite

With the rapidly growing number of scientific publications, researchers face an increasing challenge of discovering the current research topics and methodologies in a scientific domain. This paper describes an unsupervised topic detection approach that utilizes the new development of transformer-based GPT-3 (Generative Pretrained Transformer 3) similarity embedding models and modern document clustering techniques. In total, 593 publication abstracts across urban study and machine learning domains were used as a case study to demonstrate the three phases of our approach. The iterative clustering phase uses the GPT-3 embeddings to represent the semantic meaning of abstracts and deploys the HDBSCAN (Hierarchical Density-based Spatial Clustering of Applications with Noise) clustering algorithm along with silhouette scores to group similar abstracts. The keyword extraction phase identifies candidate words from each abstract and selects keywords using the Maximal Marginal Relevance ranking algorithm. The keyword grouping phase produces the keyword groups to represent topics in each abstract cluster, again using GPT-3 embeddings, the HDBSCAN algorithm, and silhouette scores. The results are visualized in a web-based interactive tool that allows users to explore abstract clusters and examine the topics in each cluster through keyword grouping. Our unsupervised topic detection approach does not require labeled datasets for training and has the potential to be used in bibliometric analysis in a large collection of publications.

show abstract

Section: Pretrained Language Models and Applicationsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods

Weng

Dyer

2022

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…The last paper, "Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering" (Vahidnia, Abbasi, & Abbass, 2021) proposed a modified deep clustering method to detect research trends from the abstracts and titles of academic documents. The experimental results show that the modified DEC in conjunction with Doc2Vec can outperform other methods in the clustering task.…”

Section: Journal Of Data and Information Sciencementioning

confidence: 99%

Extraction and Evaluation of Knowledge Entities from Scientific Documents

Zhang

Mayr

et al. 2021

Journal of Data and Information Science

View full text Add to dashboard Cite

As a core resource of scientific knowledge, academic documents have been frequently used by scholars, especially newcomers to a given field. In the era of big data, scientific documents such as academic articles, patents, technical reports, and webpages are booming. The rapid daily growth of scientific documents indicates that a large amount of knowledge is proposed, improved, and used (Zhang et al., 2021). In scientific documents, knowledge entities (KEs) refer to the knowledge mentioned or cited by authors, such as algorithms, models, theories, datasets and software, diseases, drugs, and genes, reflecting rich resources in diverse problemsolving scenarios (

show abstract

Keyword-based Research Field Discovery with External Knowledge Aware Hierarchical Co-clustering

Sugahara,

Okamoto

2023

Communications in Computer and Information Science

View full text Add to dashboard Cite

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Cited by 6 publications

References 34 publications

Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods

Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods

Extraction and Evaluation of Knowledge Entities from Scientific Documents

Keyword-based Research Field Discovery with External Knowledge Aware Hierarchical Co-clustering

Contact Info

Product

Resources

About