2020
DOI: 10.5117/ccr2020.2.001.maie
|View full text |Cite
|
Sign up to set email alerts
|

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Abstract: Topic modeling enables researchers to explore large document corpora. Large corpora, however, can be extremely costly to model in terms of time and computing resources. In order to circumvent this problem, two techniques have been suggested: (1) to model random document samples, and (2) to prune the vocabulary of the corpus. Although frequently applied, there has been no systematic inquiry into how the application of these techniques affects the respective models. Using three empirical corpora with different … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 22 publications
(13 citation statements)
references
References 10 publications
0
13
0
Order By: Relevance
“…However, it raises the question whether the number of sildenafil reviews was sufficient for topic modeling. It is reported that the sample size requirement for topic modeling varies with document characteristics, such as content heterogeneity and document length [ 50 , 51 ]. Patient medication reviews have a longer document length than typical tweets.…”
Section: Discussionmentioning
confidence: 99%
“…However, it raises the question whether the number of sildenafil reviews was sufficient for topic modeling. It is reported that the sample size requirement for topic modeling varies with document characteristics, such as content heterogeneity and document length [ 50 , 51 ]. Patient medication reviews have a longer document length than typical tweets.…”
Section: Discussionmentioning
confidence: 99%
“…Valid texts were then preprocessed following current recommendations 35 37 , including tokenization, cleaning, stop word removal 38 , vocabulary pruning 39 , and lemmatization 40 . Texts were represented using a bag-of-words, unigram approach 41 , which decomposes texts into singular words without retaining information about word order.…”
Section: Methodsmentioning
confidence: 99%
“…After applying common natural language processing (NLP) steps such as changing all words to lowercase and stopword removal using the R packages tosca (Koppers et al, 2020) and tm (Feinerer et al, 2008), as well as duplicate removal, 3 767 047 non-empty documents remain in the relevant dataset. Maier et al (2020) showed that for datasets of 230 000 documents or more already using at least 10% of the articles results in sufficiently similar topics to the complete dataset. Thus, for a faster calculation, we use a partial dataset for the study.…”
Section: Datamentioning
confidence: 99%