2021
DOI: 10.1371/journal.pone.0243208
|View full text |Cite
|
Sign up to set email alerts
|

Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya

Abstract: Unsupervised statistical analysis of unstructured data has gained wide acceptance especially in natural language processing and text mining domains. Topic modelling with Latent Dirichlet Allocation is one such statistical tool that has been successfully applied to synthesize collections of legal, biomedical documents and journalistic topics. We applied a novel two-stage topic modelling approach and illustrated the methodology with data from a collection of published abstracts from the University of Nairobi, Ke… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 21 publications
(13 citation statements)
references
References 34 publications
0
8
0
Order By: Relevance
“…Due to the unsupervised nature of the method which potentially could remove the human bias from the review process, and capacity to process a large number of documents at a relatively low computational cost, the use of NLP and Topic Modelling is becoming more popular in academia for explorative literature studies (Valle et al, 2014;Liu et al, 2016;Asmussen and Møller, 2019;Muchene and Safari, 2021). The interpretation of the results produced by LDA models, however, might pose a challenge if the initial hypothesis is not supported by manual overview of the text material or if the number of topics produced by the model is not cross-fold validated against the initial dataset.…”
Section: Discussionmentioning
confidence: 99%
“…Due to the unsupervised nature of the method which potentially could remove the human bias from the review process, and capacity to process a large number of documents at a relatively low computational cost, the use of NLP and Topic Modelling is becoming more popular in academia for explorative literature studies (Valle et al, 2014;Liu et al, 2016;Asmussen and Møller, 2019;Muchene and Safari, 2021). The interpretation of the results produced by LDA models, however, might pose a challenge if the initial hypothesis is not supported by manual overview of the text material or if the number of topics produced by the model is not cross-fold validated against the initial dataset.…”
Section: Discussionmentioning
confidence: 99%
“…The extracted semantic structures are called topics and represent recurring patterns or clusters of co-occurring words in documents (27). Topics are extracted based on a probabilistic model that determines the most frequent co-occurring words over all documents (28). Key elements of TM are words or terms (a basic unit of discrete data), documents (a sequence of terms), corpus (a collection of documents), and document-term-matrix (DTM; a matrix that presents the frequency of each word in each document) (28).…”
Section: Topic Modelingmentioning
confidence: 99%
“…Topics are extracted based on a probabilistic model that determines the most frequent co-occurring words over all documents (28). Key elements of TM are words or terms (a basic unit of discrete data), documents (a sequence of terms), corpus (a collection of documents), and document-term-matrix (DTM; a matrix that presents the frequency of each word in each document) (28). An example of a DTM is presented in Table 1 where each cell is a frequency of terms used (column) in each document (row).…”
Section: Topic Modelingmentioning
confidence: 99%
See 2 more Smart Citations