Andreas Niekler scite author profile

Latent Dirichlet allocation (LDA) topic models are increasingly being used in communication research. Yet, questions regarding reliability and validity of the approach have received little attention thus far. In applying LDA to textual data, researchers need to tackle at least four major challenges that affect these criteria: (a) appropriate pre-processing of the text collection; (b) adequate selection of model parameters, including the number of topics to be generated; (c) evaluation of the model's reliability; and (d) the process of validly interpreting the resulting topics. We review the research literature dealing with these questions and propose a methodology that approaches these challenges. Our overall goal is to make LDA topic modeling more accessible to communication researchers and to ensure compliance with disciplinary standards. Consequently, we develop a brief hands-on user guide for applying LDA topic modeling. We demonstrate the value of our approach with empirical data from an ongoing research project.

show abstract

Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers

Schröder¹,

Niekler²,

Potthast³

2022

View full text Add to dashboard Cite

Active learning is the iterative construction of a classification model through targeted labeling, enabling significant labeling cost savings. As most research on active learning has been carried out before transformer-based language models ("transformers") became popular, despite its practical importance, comparably few papers have investigated how transformers can be combined with active learning to date. This can be attributed to the fact that using state-of-the-art query strategies for transformers induces a prohibitive runtime overhead, which effectively nullifies, or even outweighs the desired cost savings. For this reason, we revisit uncertainty-based query strategies, which had been largely outperformed before, but are particularly suited in the context of fine-tuning transformers. In an extensive evaluation, we connect transformers to experiments from previous research, assessing their performance on five widely used text classification benchmarks. For active learning with transformers, several other uncertainty-based approaches outperform the well-known prediction entropy query strategy, thereby challenging its status as most popular uncertainty baseline in active learning for text classification.

show abstract

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Maier¹,

Niekler²,

Wiedemann³

et al. 2020

View full text Add to dashboard Cite

Topic modeling enables researchers to explore large document corpora. Large corpora, however, can be extremely costly to model in terms of time and computing resources. In order to circumvent this problem, two techniques have been suggested: (1) to model random document samples, and (2) to prune the vocabulary of the corpus. Although frequently applied, there has been no systematic inquiry into how the application of these techniques affects the respective models. Using three empirical corpora with different characteristics (news articles, websites, and Tweets), we systematically investigated how different sample sizes and pruning affect the resulting topic models in comparison to models of the full corpora. Our inquiry provides evidence that both techniques are viable tools that will likely not impair the resulting model. Sample-based topic models closely resemble corpus-based models if the sample size is large enough (> 10,000 documents). Moreover, extensive pruning does not compromise the quality of the resultant topics.

show abstract

Matching Results of Latent Dirichlet Allocation for Text

Niekler¹,

Jähnichen²

2012

View full text Add to dashboard Cite

Many approaches have been introduced to enable Latent Dirichlet Allocation (LDA) models to be updated in an online manner. This includes inferring new documents into the model, passing parameter priors to the inference algorithm or a mixture of both, leading to more complicated and computationally expensive models. We present a method to match and compare the resulting LDA topics of different models with light weight easy to use similarity measures. We address the on-line problem by keeping the model inference simple and matching topics solely by their high probability word lists.

show abstract

Detecting and Assessing Contextual Change in Diachronic Text Documents using Context Volatility

Kahmann

Niekler

Heyer

2017

View full text Add to dashboard Cite

Abstract:Terms in diachronic text corpora may exhibit a high degree of semantic dynamics that is only partially captured by the common notion of semantic change. The new measure of context volatility that we propose models the degree by which terms change context in a text collection over time. The computation of context volatility for a word relies on the significance-values of its co-occurrent terms and the corresponding co-occurrence ranks in sequential time spans. We define a baseline and present an efficient computational approach in order to overcome problems related to computational issues in the data structure. Results are evaluated both, on synthetic documents that are used to simulate contextual changes, and a real example based on British newspaper texts. The data and software are avaiable at https://git.informatik.uni-leipzig.de/mam10cip/KDIR.git

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Andreas Niekler

Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology

Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Matching Results of Latent Dirichlet Allocation for Text

Detecting and Assessing Contextual Change in Diachronic Text Documents using Context Volatility

Contact Info

Product

Resources

About