How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Maier, Daniel; Niekler, Andreas; Wiedemann, Gregor; Stoltenberg, Daniela

doi:10.5117/ccr2020.2.001.maie

Cited by 22 publications

(13 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, it raises the question whether the number of sildenafil reviews was sufficient for topic modeling. It is reported that the sample size requirement for topic modeling varies with document characteristics, such as content heterogeneity and document length [ 50 , 51 ]. Patient medication reviews have a longer document length than typical tweets.…”

Section: Discussionmentioning

confidence: 99%

Comparison of the Erectile Dysfunction Drugs Sildenafil and Tadalafil Using Patient Medication Reviews: Topic Modeling Study

Kim¹,

Noh²,

Yamada³

et al. 2022

JMIR Med Inform

View full text Add to dashboard Cite

Background Topic modeling of patient medication reviews of erectile dysfunction (ED) drugs can help identify patient preferences regarding ED treatment options. The identification of a set of topics important to the patient from social network service drug reviews would inform the design of patient-centered medication counseling. Objective This study aimed to (1) identify the distinctive topics from patient medication reviews unique to tadalafil versus sildenafil; (2) determine if the primary topics are distributed differently for each drug and for each patient characteristic (age and time on ED drug therapy); and (3) test if the primary topics affect satisfaction with ED drug therapy controlling for patient characteristics. Methods Data were collected from the patient medication reviews of sildenafil and tadalafil posted on WebMD and Ask a Patient. The latent Dirichlet allocation method of natural language processing was used to identify 5 distinctive topics from the patient medication reviews on each drug. Analysis of variance and a 2-sample t test were conducted to compare the topic distribution and assess whether patient satisfaction varies with the primary topics, age, and time on medication for each ED drug. Statistical significance was tested at an alpha of .05. Results The patient medication reviews of sildenafil (N=463) had 2 topics on treatment benefit and 1 each on medication safety, marketing claim, and treatment comparison, while the patient medication reviews of tadalafil (N=919) had 2 topics on medication safety and 1 each on the remaining subjects. Sildenafil’s reviewers quite frequently (94/463, 20.4%) mentioned erection sustainability as their primary topic, whereas tadalafil’s reviewers were more concerned about severe medication safety. Those who mentioned erection sustainability as their primary topic were quite satisfied with their treatment as opposed to those who mentioned severe medication safety as their primary topic (score 3.85 vs 2.44). The discovered topics reflected the marketing claims of blue magic and amber romance for sildenafil and tadalafil, respectively. The topic of blue magic was preferred among younger patients, while the topic of amber romance was preferred among older patients. The topic alternative choices, which appeared for both the ED drugs, reflected patient interest in the comparative effectiveness and price outside the drug labeling information. Conclusions The patient medication reviews of ED drugs reflect patient preferences regarding drug labeling information, marketing claims, and alternative treatment choices. The patient preferences concerning ED treatment attributes inform the design of patient-centered communication for improved ED drug therapy.

show abstract

Section: Discussionmentioning

confidence: 99%

Comparison of the Erectile Dysfunction Drugs Sildenafil and Tadalafil Using Patient Medication Reviews: Topic Modeling Study

Kim¹,

Noh²,

Yamada³

et al. 2022

JMIR Med Inform

View full text Add to dashboard Cite

show abstract

“…Valid texts were then preprocessed following current recommendations 35 – 37 , including tokenization, cleaning, stop word removal 38 , vocabulary pruning 39 , and lemmatization 40 . Texts were represented using a bag-of-words, unigram approach 41 , which decomposes texts into singular words without retaining information about word order.…”

Section: Methodsmentioning

confidence: 99%

Specific topics, specific symptoms: linking the content of recurrent involuntary memories to mental health using computational text analysis

Yeung,

Fernandes

2023

npj Mental Health Res

View full text Add to dashboard Cite

Researchers debate whether recurrent involuntary autobiographical memories (IAMs; memories of one’s personal past retrieved unintentionally and repetitively) are pathological or ordinary. While some argue that these memories contribute to clinical disorders, recurrent IAMs are also common in everyday life. Here, we examined how the content of recurrent IAMs might distinguish between those that are maladaptive (related to worse mental health) versus benign (unrelated to mental health). Over two years, 6187 undergraduates completed online surveys about recurrent IAMs; those who experienced recurrent IAMs within the past year were asked to describe their memories, resulting in 3624 text descriptions. Using a previously validated computational approach (structural topic modeling), we identified coherent topics (e.g., “Conversations”, “Experiences with family members”) in recurrent IAMs. Specific topics (e.g., “Negative past relationships”, “Abuse and trauma”) were uniquely related to symptoms of mental health disorders (e.g., depression, PTSD), above and beyond the self-reported valence of these memories. Importantly, we also found that content in recurrent IAMs was distinct across symptom types (e.g., “Communication and miscommunication” was related to social anxiety, but not symptoms of other disorders), suggesting that while negative recurrent IAMs are transdiagnostic, their content remains unique across different types of mental health concerns. Our work shows that topics in recurrent IAMs—and their links to mental health—are identifiable, distinguishable, and quantifiable.

show abstract

“…After applying common natural language processing (NLP) steps such as changing all words to lowercase and stopword removal using the R packages tosca (Koppers et al, 2020) and tm (Feinerer et al, 2008), as well as duplicate removal, 3 767 047 non-empty documents remain in the relevant dataset. Maier et al (2020) showed that for datasets of 230 000 documents or more already using at least 10% of the articles results in sufficiently similar topics to the complete dataset. Thus, for a faster calculation, we use a partial dataset for the study.…”

Section: Datamentioning

confidence: 99%

RollingLDA: An Update Algorithm of Latent Dirichlet Allocation to Construct Consistent Time Series from Textual Data

Rieger¹,

Jentsch²,

Rahnenführer³

2021

Findings of the Association for Computational Linguistics: EMNLP 2021

View full text Add to dashboard Cite

We propose a rolling version of the Latent Dirichlet Allocation, called RollingLDA. By a sequential approach, it enables the construction of LDA-based time series of topics that are consistent with previous states of LDA models. After an initial modeling, updates can be computed efficiently, allowing for real-time monitoring and detection of events or structural breaks. For this purpose, we propose suitable similarity measures for topics and provide simulation evidence of superiority over other commonly used approaches. The adequacy of the resulting method is illustrated by an application to an example corpus. In particular, we compute the similarity of sequentially obtained topic and word distributions over consecutive time periods. For a representative example corpus consisting of The New York Times articles from 1980 to 2020, we analyze the effect of several tuning parameter choices and we run the RollingLDA method on the full dataset of approximately 4 million articles to demonstrate its feasibility.

show abstract

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Cited by 22 publications

References 10 publications

Comparison of the Erectile Dysfunction Drugs Sildenafil and Tadalafil Using Patient Medication Reviews: Topic Modeling Study

Comparison of the Erectile Dysfunction Drugs Sildenafil and Tadalafil Using Patient Medication Reviews: Topic Modeling Study

Specific topics, specific symptoms: linking the content of recurrent involuntary memories to mental health using computational text analysis

RollingLDA: An Update Algorithm of Latent Dirichlet Allocation to Construct Consistent Time Series from Textual Data

Contact Info

Product

Resources

About