A Data-driven Latent Semantic Analysis for Automatic Text Summarization using LDA Topic Modelling

Onah, Daniel F. O.; Pang, Elaine L. L.; El-Haj, Mahmoud

doi:10.1109/bigdata55660.2022.10020259

Cited by 12 publications

(4 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several studies have also demonstrated the viability of a topic modeling approach on extractive summarization [14]- [16]. Those studies incorporated bag-of-words approach-based algorithms, such as latent dirichlet allocation (LDA) and latent semantic analysis (LSA).…”

Section: Related Work 21 Topic Modelingmentioning

confidence: 99%

Hybrid model for extractive single document summarization: utilizing BERTopic and BERT model

Maryanto,

Philips,

Suganda Girsang

2024

IJ-AI

View full text Add to dashboard Cite

Extractive text summarization has been a popular research area for many years. The goal of this task is to generate a compact and coherent summary of a given document, preserving the most important information. However, current extractive summarization methods still face several challenges such as semantic drift, repetition, redundancy, and lack of coherence. A novel approach is presented in this paper to improve the performance of an extractive summarization model based on bidirectional encoder representations from transformers (BERT) by incorporating topic modeling using the BERTopic model. Our method first utilizes BERTopic to identify the dominant topics in a document and then employs a BERT-based deep neural network to extract the most salient sentences related to those topics. Our experiments on the cable news network (CNN)/daily mail dataset demonstrate that our proposed method outperforms state-of-the-art BERT-based extractive summarization models in terms of recall-oriented understudy for gisting evaluation (ROUGE) scores, which resulted in an increase of 32.53% of ROUGE-1, 47.55% of ROUGE-2, and 16.63% of ROUGE-L when compared to baseline BERT-based extractive summarization models. This paper contributes to the field of extractive text summarization, highlights the potential of topic modeling in improving summarization results, and provides a new direction for future research.

show abstract

Section: Related Work 21 Topic Modelingmentioning

confidence: 99%

Hybrid model for extractive single document summarization: utilizing BERTopic and BERT model

Maryanto,

Philips,

Suganda Girsang

2024

IJ-AI

View full text Add to dashboard Cite

show abstract

“…The analyst should find the optimal interpretation by changing lambda to a value between 0 and 1. If lambda is close to 0, the characteristics of the subject are emphasized, but unnecessary junk words may also be extracted; if lambda is close to 1, words that reveal the characteristics of the subject may not appear [29].…”

Section: Modelingmentioning

confidence: 99%

Digital Health Discussion Through Articles Published Until the Year 2021: A Digital Topic Modeling Approach (Preprint)

Sung¹,

Kim²

2022

Preprint

View full text Add to dashboard Cite

BACKGROUND Since the 2010s, the digital health industry has grown significantly, gaining popularity with the public. The term “Digital Health” is being explored in various academic fields, such as public health, medicine, and computer science. OBJECTIVE This study analyzes the research trends of digital health related articles published in the Web of Science until 2021 to understand the research concentration, boundary, scope, and characteristics. METHODS By crawling and preprocessing 27,638 digital health-related papers provided by Web of Science and investigating 15,950 of them, the number of articles published by year and by field are compared and analyzed. Since these 15,950 papers belong to the top 10 academic fields, they were regrouped into three major fields: public health, medicine, and electrical engineering and computer science (EECS). Latent Dirichlet Allocation (LDA) is applied as a topic modeling method for each field and time period. The number of topics is determined based on the coherence score. RESULTS The number of optimal topics in the first and second halves for public health were 13 and 19, for medicine, 14 and 25, and for EECS, 7 and 21, respectively. Text analysis showed that articles from public health, medicine, and EECS share similar topics but vary in composition. The homogeneity test showed that the contrast between each group is significant (p<2.2e-16). All the topics revealed in articles could be categorized into six dominant themes; journal article methodology, information technology, medical issues, subject, social phenomenon, and healthcare. As a result of the LDA analysis, the topics of each domain differed, and the composition of each theme was different between academic fields and time periods. Studies on public health focused on social phenomena, prevention, and daily care, while studies in medicine investigated treatment and cure issues in the second half. Studies in EECS highlighted the importance of technical issues, while showing a comparatively distant relation to public health or medicine. All fields emphasized information technology (IT) in the first half, and each domain published specialized articles in the second half. In particular, there were numerous articles belonging to both public health and medicine, while only a few were common with EECS. CONCLUSIONS The articles belonging to each domain became more specialized and distinguished from other domains and all three fields highlighted social phenomena and healthcare over time. With Covid-19 becoming a dominant issue recently, digital health has come to be strongly related to depression and mental disorders, education, and physical activity with articles on these topics appearing in the second half in all fields. The scope of digital health research is expanding and its composition fluctuating. In the future, it will be necessary to explore papers on expanded topics that reflect people's needs for digital health.

show abstract

“…LDA also facilitates abstraction through topic summarisation. LDA generates a set of word-topic distributions representing the probability of each word occurring in each topic (Onah et al, 2022). Through examining the most probable words that are associated within each topic, researchers can gain an understanding of the main concepts and themes that are represented by the topics.…”

Section: Introductionmentioning

confidence: 99%

“…Through examining the most probable words that are associated within each topic, researchers can gain an understanding of the main concepts and themes that are represented by the topics. This summarisation aids in distilling key information and abstracting the data (Onah et al, 2022). This then helps to accomplish a key outcome sought throughout an SLR; the identification of gaps in the research domain under investigation through a comprehensive summary of its pertinent research (Paul et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Cheap, Quick, and Rigorous: Artificial Intelligence and the Systematic Literature Review

Atkinson

2023

Social Science Computer Review

View full text Add to dashboard Cite

The systematic literature review (SLR) is the gold standard in providing research a firm evidence foundation to support decision-making. Researchers seeking to increase the rigour, transparency, and replicability of their SLRs are provided a range of guidelines towards these ends. Artificial Intelligence (AI) and Machine Learning Techniques (MLTs) developed with computer programming languages can provide methods to increase the speed, rigour, transparency, and repeatability of SLRs. Aimed towards researchers with coding experience, and who want to utilise AI and MLTs to synthesise and abstract data obtained through a SLR, this article sets out how computer languages can be used to facilitate unsupervised machine learning for synthesising and abstracting data sets extracted during a SLR. Utilising an already known qualitative method, Deductive Qualitative Analysis, this article illustrates the supportive role that AI and MLTs can play in the coding and categorisation of extracted SLR data, and in synthesising SLR data. Using a data set extracted during a SLR as a proof of concept, this article will include the coding used to create a well-established MLT, Topic Modelling using Latent Dirichlet allocation. This technique provides a working example of how researchers can use AI and MLTs to automate the data synthesis and abstraction stage of their SLR, and aide in increasing the speed, frugality, and rigour of research projects.

show abstract

A Data-driven Latent Semantic Analysis for Automatic Text Summarization using LDA Topic Modelling

Cited by 12 publications

References 31 publications

Hybrid model for extractive single document summarization: utilizing BERTopic and BERT model

Hybrid model for extractive single document summarization: utilizing BERTopic and BERT model

Digital Health Discussion Through Articles Published Until the Year 2021: A Digital Topic Modeling Approach (Preprint)

Cheap, Quick, and Rigorous: Artificial Intelligence and the Systematic Literature Review

Contact Info

Product

Resources

About