An architecture for parallel topic models

Smola, Alexander J.; Narayanamurthy, Shravan

doi:10.14778/1920841.1920931

Cited by 341 publications

(273 citation statements)

References 8 publications

(12 reference statements)

Supporting

Mentioning

266

Contrasting

Order By: Relevance

“…In addition to providing a solution to the problem of growing document collections, online algorithms also open up different avenues for parallelization of inference from batch algorithms, providing ways to draw on the enhanced computing power of multiprocessor systems, and different tradeoffs in runtime and performance from other algorithms; another common approach to scaling LDA, see e.g. [18,15,22] and reference in there. Here, we explore another avenue opened up by online LDA algorithms, namely, to revisit batch LDA and ask the question whether we can improve it by viewing it as a quasi-online approach that processes documents respectively mini-batches one at a time?…”

Section: Introductionmentioning

confidence: 99%

Larger Residuals, Less Work: Active Document Scheduling for Latent Dirichlet Allocation

Wahabzada

Kersting

2011

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. Recently, there have been considerable advances in fast inference for latent Dirichlet allocation (LDA). In particular, stochastic optimization of the variational Bayes (VB) objective function with a natural gradient step was proved to converge and able to process massive document collections. To reduce noise in the gradient estimation, it considers multiple documents chosen uniformly at random. While it is widely recognized that the scheduling of documents in stochastic optimization may have significant consequences, this issue remains largely unexplored. In this work, we address this issue. Specifically, we propose residual LDA, a novel, easy-to-implement, LDA approach that schedules documents in an informed way. Intuitively, in each iteration, residual LDA actively selects documents that exert a disproportionately large influence on the current residual to compute the next update. On several real-world datasets, including 3M articles from Wikipedia, we demonstrate that residual LDA can handily analyze massive document collections and find topic models as good or better than those found with batch VB and randomly scheduled VB, and significantly faster.

show abstract

Section: Introductionmentioning

confidence: 99%

Larger Residuals, Less Work: Active Document Scheduling for Latent Dirichlet Allocation

Wahabzada

Kersting

2011

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

show abstract

“…Algorithmic improvements [15] and pipeline processing or other scheduling techniques [16], [17] are for future work.…”

Section: Resultsmentioning

confidence: 99%

“…• Smola et al [16] developed a parallel inference method for LDA on a cluster system. It introduces a blackboard-style architecture to facilitate simultaneous communication and sampling between different computers in a cluster environment.…”

Section: Fast Inference Methods For Ldamentioning

confidence: 99%

MPI/OpenMP Hybrid Parallel Inference Methods for Latent Dirichlet Allocation — Approximation and Evaluation

Tora

Eguchi

2013

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYRecently, probabilistic topic models have been applied to various types of data, including text, and their effectiveness has been demonstrated. Latent Dirichlet allocation (LDA) is a well known topic model. Variational Bayesian inference or collapsed Gibbs sampling is often used to estimate parameters in LDA; however, these inference methods incur high computational cost for large-scale data. Therefore, highly efficient technology is needed for this purpose. We use parallel computation technology for efficient collapsed Gibbs sampling inference for LDA. We assume a symmetric multiprocessing (SMP) cluster, which has been widely used in recent years. In prior work on parallel inference for LDA, either MPI or OpenMP has often been used alone. For an SMP cluster, however, it is more suitable to adopt hybrid parallelization that uses message passing for communication between SMP nodes and loop directives for parallelization within each SMP node. We developed an MPI/OpenMP hybrid parallel inference method for LDA, and evaluated the performance of the inference under various settings of an SMP cluster. We further investigated the approximation that controls the inter-node communications, and found out that it achieved noticeable increase in inference speed while maintaining inference accuracy.

show abstract

“…A distributed system developed at Yahoo! is reported to process 42,000 documents per hour [103]. Alternatively, an online inference algorithm for LDA [47] promises both improved scalability and a principled means of updating topics to reflect new documents.…”

Section: Discussionmentioning

confidence: 99%