Polylingual topic models

Mimno, David; Wallach, Hanna; Naradowsky, Jason; Smith, David A.; McCallum, Andrew

doi:10.3115/1699571.1699627

Cited by 431 publications

(463 citation statements)

References 11 publications

(9 reference statements)

Supporting

Mentioning

316

Contrasting

Unclassified

Order By: Relevance

“…Our context-aware models are generic and allow experimentations with different models that induce latent cross-lingual semantic concepts. However, in this particular work we present results obtained by a multilingual probabilistic topic model called bilingual LDA (Mimno et al, 2009;Ni et al, 2009;De Smet and Moens, 2009). The BiLDA model is a straightforward multilingual extension of the standard LDA model (Blei et al, 2003).…”

Section: Methodsmentioning

confidence: 99%

“…For instance, one could use cross-lingual Latent Semantic Indexing (Dumais et al, 1996), probabilistic Principal Component Analysis (Tipping and Bishop, 1999), or a probabilistic interpretation of non-negative matrix factorization (Lee and Seung, 1999;Gaussier and Goutte, 2005;Ding et al, 2008) on concatenated documents in aligned document pairs. Other more recent models include matching canonical correlation analysis (Haghighi et al, 2008;Daumé III and Jagarlamudi, 2011) and multilingual probabilistic topic models (Ni et al, 2009;De Smet and Moens, 2009;Mimno et al, 2009;Boyd-Graber and Blei, 2009;Zhang et al, 2010;Fukumasu et al, 2012).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Probabilistic Models of Cross-Lingual Semantic Similarity in Context Based on Latent Cross-Lingual Concepts Induced from Comparable Data

Vulić¹,

Moens²

2014

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

We propose the first probabilistic approach to modeling cross-lingual semantic similarity (CLSS) in context which requires only comparable data. The approach relies on an idea of projecting words and sets of words into a shared latent semantic space spanned by language-pair independent latent semantic concepts (e.g., crosslingual topics obtained by a multilingual topic model). These latent cross-lingual concepts are induced from a comparable corpus without any additional lexical resources. Word meaning is represented as a probability distribution over the latent concepts, and a change in meaning is represented as a change in the distribution over these latent concepts. We present new models that modulate the isolated out-ofcontext word representations with contextual knowledge. Results on the task of suggesting word translations in context for 3 language pairs reveal the utility of the proposed contextualized models of crosslingual semantic similarity.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Probabilistic Models of Cross-Lingual Semantic Similarity in Context Based on Latent Cross-Lingual Concepts Induced from Comparable Data

Vulić¹,

Moens²

2014

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…Topic models have a major benefit; they don't need documents to be sentence-aligned, so it will be a good choice for finding comparable corpora. To model bilingual topics, we used an extension of latent Dirichlet allocation (LDA) called Polylingual Topic Model (Mimno et al, 2009). We consider each document as a bag of words, this approach consists of three main steps, first step is creating sets of topics for both sides (source and target languages) then calculating probability of each topic in each document and finally, finding documents similarities.…”

Section: Bilingual Topic Model Module (Bitm)mentioning

confidence: 99%

“…Another projection model, Latent Dirichlet Allocation (LDA) is based on the extraction of generative models from documents. Polylingual Topic Models (Mimno et al, 2009) are multilingual versions of LDA. Cross Language Explicit Semantic Analysis (CL-ESA) is the other model in vector context approach (Potthast et al, 2008) that uses comparable Wikipedia corpora.…”

Section: Related Workmentioning

confidence: 99%

AUT Document Alignment Framework for BUCC Workshop Shared Task

Zafarian

Sadeghi²,

Azadi

et al. 2015

Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

View full text Add to dashboard Cite

This paper presents a framework for aligning comparable documents collection. Our feature based model is able to consider different characteristics of documents for evaluating their similarities. The model uses the content of documents while no link, special tag or Metadata are available. And also we apply a filtering mechanism which made our model to be properly applicable for a large collection of data. According to the results, our model is able to recognize related documents in the target language with recall of 45.67% for the 1-best and 62% for the 5-best.

show abstract

“…We employ the Polylingual Topic Model (Mimno et al, 2009), which is originally used to model corresponding documents in different languages that are topically comparable, but not parallel translations. In particular, we employ our previous work (Mason and Charniak, 2013) which extends this model to topically similar images and natural language captions.…”

Section: Joint Topic Modelmentioning

confidence: 99%

Domain-Specific Image Captioning

Mason

Charniak

2014

Proceedings of the Eighteenth Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

We present a data-driven framework for image caption generation which incorporates visual and textual features with varying degrees of spatial structure. We propose the task of domain-specific image captioning, where many relevant visual details cannot be captured by off-the-shelf general-domain entity detectors. We extract previously-written descriptions from a database and adapt them to new query images, using a joint visual and textual bag-of-words model to determine the correctness of individual words. We implement our model using a large, unlabeled dataset of women's shoes images and natural language descriptions (Berg et al., 2010). Using both automatic and human evaluations, we show that our captioning method effectively deletes inaccurate words from extracted captions while maintaining a high level of detail in the generated output.

show abstract

Polylingual topic models

Cited by 431 publications

References 11 publications

Probabilistic Models of Cross-Lingual Semantic Similarity in Context Based on Latent Cross-Lingual Concepts Induced from Comparable Data

Probabilistic Models of Cross-Lingual Semantic Similarity in Context Based on Latent Cross-Lingual Concepts Induced from Comparable Data

AUT Document Alignment Framework for BUCC Workshop Shared Task

Domain-Specific Image Captioning

Contact Info

Product

Resources

About