C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content

Heyman, Geert; Vulić, Ivan; Moens, Marie‐Francine

doi:10.1007/s10618-015-0442-x

Cited by 13 publications

(10 citation statements)

References 40 publications

(55 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In consequence, this also influences topic distributions of related words not occurring in the dictionary. Another group of models utilizes alignments at the document level (Mimno, Wallach, Naradowsky, Smith, & McCallum, 2009;Platt, Toutanova, & Yih, 2010;Vulić, De Smet, & Moens, 2011;Fukumasu, Eguchi, & Xing, 2012;Heyman, Vulić, & Moens, 2016) to induce shared topical spaces. The very same level of supervision (i.e., document alignments) is used by several cross-lingual word embedding models, surveyed in Section 8.…”

Section: A Brief History Of Cross-lingual Word Representationsmentioning

confidence: 99%

A Survey of Cross-lingual Word Embedding Models

Ruder¹,

Vulić²,

Søgaard³

2019

jair

Self Cite

437

330

View full text Add to dashboard Cite

Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.

show abstract

Section: A Brief History Of Cross-lingual Word Representationsmentioning

confidence: 99%

A Survey of Cross-lingual Word Embedding Models

Ruder¹,

Vulić²,

Søgaard³

2019

jair

Self Cite

437

330

View full text Add to dashboard Cite

show abstract

“…that are annotated on the Reuters documents, for example: when an English document and a Spanish document are both annotated with the same global label they are considered to have comparable content and are added as a document pair to the comparable corpus. We analysed the resulting dataset with multilingual probabilistic topic models: Bilingual Latent Dirichlet Allocation (BiLDA) [53] and Comparable Bilingual Latent Dirichlet Allocation (C-BiLDA) [54]. We found that, although the C-BiLDA model could uncover some interesting cross-lingual topics (clusters of related words), the dataset was not well-suited for inducing translations as the domain was too broad and the comparability across languages too low.…”

Section: Terminology Extraction From Comparable Textmentioning

confidence: 99%

“…All three models learn bilingual word representations from subject-aligned document pairs only. Multilingual topic modeling has shown to be a robust framework for learning bilingual representations from such non-parallel data: BiLDA has been successfully applied to BLI [56] and C-BiLDA is a more recent extension to BiLDA that learns higher-quality representations when the aligned document pairs exhibit a lower degree of parallelism [54]. BWESG is a simple but effective extension to continuous skip-gram.…”

Section: Comparison Of Weakly-supervised Word-level Bli Modelsmentioning

confidence: 99%

Improving the Translation Environment for Professional Translators

et al. 2019

Self Cite

View full text Add to dashboard Cite

When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project.

show abstract

“…Most multilingual topic models are generative admixture models in which the conditional probabilities can be factorized into different levels, thus KL-divergence term in Theorem 3 can be decomposed and analyzed in the same way as in this section for models that have transfer at other levels, such as , Heyman et al (2016), and Hu et al (2014). For example, if a model has word-level transfer, i.e., the model assumes that word translations share the same distributions, we have a KL-divergence term as,…”

Section: Multilevel Transfermentioning

confidence: 99%

Analyzing

Hao

Paul

2019

Proceedings of the 2019 Conference of the North

View full text Add to dashboard Cite

We introduce a theoretical analysis of crosslingual transfer in probabilistic topic models. By formulating posterior inference through Gibbs sampling as a process of language transfer, we propose a new measure that quantifies the loss of knowledge across languages during this process. This measure enables us to derive a PAC-Bayesian bound that elucidates the factors affecting model quality, both during training and in downstream applications. We provide experimental validation of the analysis on a diverse set of five languages, and discuss best practices for data collection and model design based on our analysis.

show abstract

C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content

Cited by 13 publications

References 40 publications

A Survey of Cross-lingual Word Embedding Models

A Survey of Cross-lingual Word Embedding Models

Improving the Translation Environment for Professional Translators

Analyzing

Contact Info

Product

Resources

About