Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing Volume 2 - EMNLP '09 2009
DOI: 10.3115/1699571.1699627
|View full text |Cite
|
Sign up to set email alerts
|

Polylingual topic models

Abstract: Topic models are a useful tool for analyzing large text collections, but have previously been applied in only monolingual, or at most bilingual, contexts. Meanwhile, massive collections of interlinked documents in dozens of languages, such as Wikipedia, are now widely available, calling for tools that can characterize content in many languages. We introduce a polylingual topic model that discovers topics aligned across multiple languages. We explore the model's characteristics using two large corpora, each wit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
316
0
3

Year Published

2014
2014
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 431 publications
(463 citation statements)
references
References 11 publications
(9 reference statements)
1
316
0
3
Order By: Relevance
“…Our context-aware models are generic and allow experimentations with different models that induce latent cross-lingual semantic concepts. However, in this particular work we present results obtained by a multilingual probabilistic topic model called bilingual LDA (Mimno et al, 2009;Ni et al, 2009;De Smet and Moens, 2009). The BiLDA model is a straightforward multilingual extension of the standard LDA model (Blei et al, 2003).…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Our context-aware models are generic and allow experimentations with different models that induce latent cross-lingual semantic concepts. However, in this particular work we present results obtained by a multilingual probabilistic topic model called bilingual LDA (Mimno et al, 2009;Ni et al, 2009;De Smet and Moens, 2009). The BiLDA model is a straightforward multilingual extension of the standard LDA model (Blei et al, 2003).…”
Section: Methodsmentioning
confidence: 99%
“…For instance, one could use cross-lingual Latent Semantic Indexing (Dumais et al, 1996), probabilistic Principal Component Analysis (Tipping and Bishop, 1999), or a probabilistic interpretation of non-negative matrix factorization (Lee and Seung, 1999;Gaussier and Goutte, 2005;Ding et al, 2008) on concatenated documents in aligned document pairs. Other more recent models include matching canonical correlation analysis (Haghighi et al, 2008;Daumé III and Jagarlamudi, 2011) and multilingual probabilistic topic models (Ni et al, 2009;De Smet and Moens, 2009;Mimno et al, 2009;Boyd-Graber and Blei, 2009;Zhang et al, 2010;Fukumasu et al, 2012).…”
Section: Introductionmentioning
confidence: 99%
“…Topic models have a major benefit; they don't need documents to be sentence-aligned, so it will be a good choice for finding comparable corpora. To model bilingual topics, we used an extension of latent Dirichlet allocation (LDA) called Polylingual Topic Model (Mimno et al, 2009). We consider each document as a bag of words, this approach consists of three main steps, first step is creating sets of topics for both sides (source and target languages) then calculating probability of each topic in each document and finally, finding documents similarities.…”
Section: Bilingual Topic Model Module (Bitm)mentioning
confidence: 99%
“…Another projection model, Latent Dirichlet Allocation (LDA) is based on the extraction of generative models from documents. Polylingual Topic Models (Mimno et al, 2009) are multilingual versions of LDA. Cross Language Explicit Semantic Analysis (CL-ESA) is the other model in vector context approach (Potthast et al, 2008) that uses comparable Wikipedia corpora.…”
Section: Related Workmentioning
confidence: 99%
“…We employ the Polylingual Topic Model (Mimno et al, 2009), which is originally used to model corresponding documents in different languages that are topically comparable, but not parallel translations. In particular, we employ our previous work (Mason and Charniak, 2013) which extends this model to topically similar images and natural language captions.…”
Section: Joint Topic Modelmentioning
confidence: 99%