Efficient Methods for Incorporating Knowledge into Topic Models

Yang, Yi; Downey, Doug; Boyd-Graber, Jordan

doi:10.18653/v1/d15-1037

Cited by 43 publications

(31 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More broadly, there are many efforts to improve the semantic interpretability of topic models . In particular, much work has improved topic quality via different priors: Wallach et al show the effectiveness of general asymmetric priors to improve topic quality, Newman et al use an informative prior capturing short range dependencies between words, and Andrzejewski et al use Dirichlet Forest priors to capture corpus structure.…”

Section: Related Workmentioning

confidence: 99%

Assessing topic model relevance: Evaluation and informative priors

Fan

Doshi‐Velez

Miratrix

2019

Statistical Analysis

View full text Add to dashboard Cite

Latent Dirichlet allocation (LDA) models trained without stopword removal often produce topics with high posterior probabilities on uninformative words, obscuring the underlying corpus content. Even when canonical stopwords are manually removed, uninformative words common in that corpus will still dominate the most probable words in a topic. In this work, we first show how the standard topic quality measures of coherence and pointwise mutual information act counter-intuitively in the presence of common but irrelevant words, making it difficult to even quantitatively identify situations in which topics may be dominated by stopwords. We propose an additional topic quality metric that targets the stopword problem, and show that it, unlike the standard measures, correctly correlates with human judgments of quality as defined by concentration of information-rich words. We also propose a simple-to-implement strategy for generating topics that are evaluated to be of much higher quality by both human assessment and our new metric. This approach, a collection of informative priors easily introduced into most LDA-style inference methods, automatically promotes terms with domain relevance and demotes domain-specific stop words. We demonstrate this approach's effectiveness in three very different domains: Department of Labor accident reports, online health forum posts, and NIPS abstracts. Overall we find that current practices thought to solve this problem do not do so adequately, and that our proposal offers a substantial improvement for those interested in interpreting their topics as objects in their own right. K E Y W O R D Sinformative priors, latent dirichlet allocation, topic modeling INTRODUCTIONLatent Dirichlet allocation (LDA) [4] is a popular model for modeling topics in large textual corpora as probability vectors over terms in the vocabulary. LDA posits that each document d is a mixture d over K topics, each topic k is a mixture k over a common, set vocabulary of size V, and w d, n , the nth word in document d, is generated by first sampling a topic z d, n from d and then drawing a word from that topic:

show abstract

Section: Related Workmentioning

confidence: 99%

Assessing topic model relevance: Evaluation and informative priors

Fan

Doshi‐Velez

Miratrix

2019

Statistical Analysis

View full text Add to dashboard Cite

show abstract

“…publication year and publication type) can be included as covariates which inform either document-topic proportions (topic prevalence) or topic-term probabilities (topic content) (Roberts et al, 2014: 5). Topic models with document metadata covariates have been shown to produce more coherent and domain-specific topics (Yang et al, 2015), and to perform better in terms of statistical quantities of interest, e.g. calculated covariate relationships with uncertainty estimates (Roberts et al, 2014).…”

Section: Model Specification and Estimationmentioning

confidence: 99%

Cannabis in Danish newspapers

Houborg

Enghoff²

2018

TFSS

View full text Add to dashboard Cite

Using quantitative methods Danish cannabis debate in national newspapers is investigated. The investigation shows that the most prevalent topics relate to law enforcement. Legalization has become an increasingly important topic in the Danish cannabis debate and the investigation shows a reframing of this debate to become increasingly related to concerns about organized crime. In this way the Danish cannabis legalization debate show the same development as the debates that have led to legalization certain states in the United States of America.

show abstract

“…As the acceptance of topic coherence measures increases as a mean of topic model assessment (Paul and Girju, 2010;Reisinger et al, 2010;Hall et al, 2012), recent research trends focus on proposing fast and efficient models that can be scaled up to big amounts of data (Yang et al, 2015;Nguyen et al, 2015), using the whole text per document for training.…”

Section: Related Workmentioning

confidence: 99%

Improving Topic Coherence Using Entity Extraction Denoising

Cardenas¹,

Bello²,

Coronado³

et al. 2018

The Prague Bulletin of Mathematical Linguistics

View full text Add to dashboard Cite

Managing large collections of documents is an important problem for many areas of science, industry, and culture. Probabilistic topic modeling offers a promising solution. Topic modeling is an unsupervised machine learning method and the evaluation of this model is an interesting problem on its own. Topic interpretability measures have been developed in recent years as a more natural option for topic quality evaluation, emulating human perception of coherence with word sets correlation scores. In this paper, we show experimental evidence of the improvement of topic coherence score by restricting the training corpus to that of relevant information in the document obtained by Entity Recognition. We experiment with job advertisement data and find that with this approach topic models improve interpretability in about 40 percentage points on average. Our analysis reveals as well that using the extracted text chunks, some redundant topics are joined while others are split into more skill-specific topics. Fine-grained topics observed in models using the whole text are preserved.

show abstract

Efficient Methods for Incorporating Knowledge into Topic Models

Cited by 43 publications

References 24 publications

Assessing topic model relevance: Evaluation and informative priors

Assessing topic model relevance: Evaluation and informative priors

Cannabis in Danish newspapers

Improving Topic Coherence Using Entity Extraction Denoising

Contact Info

Product

Resources

About