Larger Residuals, Less Work: Active Document Scheduling for Latent Dirichlet Allocation

Wahabzada, Mirwaes; Kersting, Kristian

doi:10.1007/978-3-642-23808-6_31

Cited by 12 publications

(20 citation statements)

References 15 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, we show that the unified EM framework can explain recent LDA inference algorithms like VB [2], GS [5], CVB [7] and BP [8]. Experiments on four big data streams confirm that FOEM is significantly faster and more memoryefficient than the state-of-the-art online LDA algorithms including OGS [11], OVB [12], RVB [13], SOI [14] and SCVB [15]. We anticipate that the proposed FOEM can be also extended to compute ML or MAP estimate of other mixture models and latent variable models [30].…”

Section: Introductionmentioning

confidence: 69%

“…where −w, −d and −(w, d) denote all word indices except w, all document indices except d, and all word indices except {w, d}. After the E-step for each word, the M-step will update the sufficient statistics immediately by adding the updated responsibility µ w,d (k) (13) into (14), (15) and (16).…”

Section: Online Em (Oem) For Ldamentioning

confidence: 99%

“…Recently, most online LDA algorithms are combinations of the stochastic optimization framework [22] with batch LDA algorithms like VB, GS, and CVB, e.g., online VB (OVB) [12], residual VB (RVB) [13], online GS (OGS) [11], sampled online inference (SOI) [14], and stochastic CVB (SCVB) [15]. However, these algorithms focus mainly on the big data problem but rarely on the big model problem.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Fast Online EM for Big Topic Modeling

Zeng

Liu

Cao

2016

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

The expectation-maximization (EM) algorithm can compute the maximum-likelihood (ML) or maximum a posterior (MAP) point estimate of the mixture models or latent variable models such as latent Dirichlet allocation (LDA), which has been one of the most popular probabilistic topic modeling methods in the past decade. However, batch EM has high time and space complexities to learn big LDA models from big data streams. In this paper, we present a fast online EM (FOEM) algorithm that infers the topic distribution from the previously unseen documents incrementally with constant memory requirements. Within the stochastic approximation framework, we show that FOEM can converge to the local stationary point of the LDA's likelihood function. By dynamic scheduling for the fast speed and parameter streaming for the low memory usage, FOEM is more efficient for some lifelong topic modeling tasks than the state-of-the-art online LDA algorithms to handle both big data and big models (aka, big topic modeling) on just a PC.Comment: 14 pages, 12 figures in IEEE Transactions on Knowledge and Data Engineering, 201

show abstract

Section: Introductionmentioning

confidence: 69%

Section: Online Em (Oem) For Ldamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Fast Online EM for Big Topic Modeling

Zeng

Liu

Cao

2016

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

show abstract

“…Then, we turned isLDA into a novel, easy-to-implement oLDA approach, called isoLDA, that scales well to massive and growing datasets by applying influence scheduling to randomly formed batches. Based on the results of the present paper, [8] have recently developed the first active LDA.…”

Section: Resultsmentioning

confidence: 99%

More influence means less work

Wahabzada

Kersting

Pilz

et al. 2011

Proceedings of the 20th ACM International Conference on Information and Knowledge Management

Self Cite

View full text Add to dashboard Cite

There have recently been considerable advances in fast inference for (online) latent Dirichlet allocation (LDA). While it is widely recognized that the scheduling of documents in stochastic optimization and in turn in LDA may have significant consequences, this issue remains largely unexplored. Instead, practitioners schedule documents essentially uniformly at random, due perhaps to ease of implementation, and to the lack of clear guidelines on scheduling the documents. In this work, we address this issue and propose to schedule documents for an update that exert a disproportionately large influence on the topics of the corpus before less influential ones. More precisely, we justify to sample documents randomly biased towards those ones with higher norms to form mini-batches. On several real-world datasets, including 3M articles from Wikipedia and 8M from PubMed, we demonstrate that the resulting influence scheduled LDA can handily analyze massive document collections and find topic models as good or better than those found with online LDA, often at a fraction of time.

show abstract

“…Thus, the higher communication rate leads to the larger communication cost in parallel online LDA algorithms. Therefore, it is nontrivial to reduce the communication complexity (5) for parallel online LDA algorithms [11], [12], [21], [27], [28] in order to achieve a better scalability performance. Moreover, not all parallel batch LDA algorithms based on MPA have been proved to converge to the local optimum of the LDA's objective function.…”

Section: Mpamentioning

confidence: 99%

Towards big topic modeling

Yan

Zeng

Liu

et al. 2017

Information Sciences

View full text Add to dashboard Cite

Abstract-To solve the big topic modeling problem, we need to reduce both time and space complexities of batch latent Dirichlet allocation (LDA) algorithms. Although parallel LDA algorithms on the multi-processor architecture have low time and space complexities, their communication costs among processors often scale linearly with the vocabulary size and the number of topics, leading to a serious scalability problem. To reduce the communication complexity among processors for a better scalability, we propose a novel communication-efficient parallel topic modeling architecture based on power law, which consumes orders of magnitude less communication time when the number of topics is large. We combine the proposed communication-efficient parallel architecture with the online belief propagation (OBP) algorithm referred to as POBP for big topic modeling tasks. Extensive empirical results confirm that POBP has the following advantages to solve the big topic modeling problem: 1) high accuracy, 2) communication-efficient, 3) fast speed, and 4) constant memory usage when compared with recent state-of-the-art parallel LDA algorithms on the multi-processor architecture.

show abstract

Larger Residuals, Less Work: Active Document Scheduling for Latent Dirichlet Allocation

Cited by 12 publications

References 15 publications

Fast Online EM for Big Topic Modeling

Fast Online EM for Big Topic Modeling

More influence means less work

Towards big topic modeling

Contact Info

Product

Resources

About