2010
DOI: 10.14778/1920841.1920931
|View full text |Cite
|
Sign up to set email alerts
|

An architecture for parallel topic models

Abstract: This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
266
0

Year Published

2011
2011
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 341 publications
(273 citation statements)
references
References 8 publications
(12 reference statements)
3
266
0
Order By: Relevance
“…In addition to providing a solution to the problem of growing document collections, online algorithms also open up different avenues for parallelization of inference from batch algorithms, providing ways to draw on the enhanced computing power of multiprocessor systems, and different tradeoffs in runtime and performance from other algorithms; another common approach to scaling LDA, see e.g. [18,15,22] and reference in there. Here, we explore another avenue opened up by online LDA algorithms, namely, to revisit batch LDA and ask the question whether we can improve it by viewing it as a quasi-online approach that processes documents respectively mini-batches one at a time?…”
Section: Introductionmentioning
confidence: 99%
“…In addition to providing a solution to the problem of growing document collections, online algorithms also open up different avenues for parallelization of inference from batch algorithms, providing ways to draw on the enhanced computing power of multiprocessor systems, and different tradeoffs in runtime and performance from other algorithms; another common approach to scaling LDA, see e.g. [18,15,22] and reference in there. Here, we explore another avenue opened up by online LDA algorithms, namely, to revisit batch LDA and ask the question whether we can improve it by viewing it as a quasi-online approach that processes documents respectively mini-batches one at a time?…”
Section: Introductionmentioning
confidence: 99%
“…Algorithmic improvements [15] and pipeline processing or other scheduling techniques [16], [17] are for future work.…”
Section: Resultsmentioning
confidence: 99%
“…• Smola et al [16] developed a parallel inference method for LDA on a cluster system. It introduces a blackboard-style architecture to facilitate simultaneous communication and sampling between different computers in a cluster environment.…”
Section: Fast Inference Methods For Ldamentioning
confidence: 99%
“…A distributed system developed at Yahoo! is reported to process 42,000 documents per hour [103]. Alternatively, an online inference algorithm for LDA [47] promises both improved scalability and a principled means of updating topics to reflect new documents.…”
Section: Discussionmentioning
confidence: 99%