SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings

Sabet, Masoud Jalili; Dufter, Philipp; Yvon, François; Schütze, Hinrich

doi:10.18653/v1/2020.findings-emnlp.147

Cited by 45 publications

(41 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It can be observed that XLM-ALIGN consistently improves the results over XLM-R base across these layers. Moreover, it shows a parabolic trend across the layers of XLM-R base , which is consistent with the results in (Jalili Sabet et al, 2020). In contrast to XLM-R base , XLM-ALIGN alleviates this trend and greatly reduces AER in the last few layers.…”

Section: Word Alignmentsupporting

confidence: 89%

Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Chi¹,

Dong²,

Zheng³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

The cross-lingual language models are typically pretrained with masked language modeling on multilingual text or parallel sentences. In this paper, we introduce denoising word alignment as a new cross-lingual pre-training task. Specifically, the model first self-labels word alignments for parallel sentences. Then we randomly mask tokens in a bitext pair. Given a masked token, the model uses a pointer network to predict the aligned token in the other language. We alternately perform the above two steps in an expectationmaximization manner. Experimental results show that our method improves cross-lingual transferability on various datasets, especially on the token-level tasks, such as question answering, and structured prediction. Moreover, the model can serve as a pretrained word aligner, which achieves reasonably low error rates on the alignment benchmarks. The code and pretrained parameters are available at github.com/CZWin32768/XLM-Align.

show abstract

Section: Word Alignmentsupporting

confidence: 89%

Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Chi¹,

Dong²,

Zheng³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…We would like to point out that parallel to the present work, Sabet et al (2020) also introduced the first two of the four methods. Since they aim to extract an explicit alignment between source and target they do not construct a score for a sentence pair and do not consider the use in a data filtering task.…”

Section: Source ↔ Target Embedding Similaritymentioning

confidence: 99%

“…For a long time, IBM-model-based frameworks like GIZA++ (Och and Ney, 2003) or fastalign (Dyer et al, 2013) produced the best word alignments. However, recently Sabet et al (2020) report equally good results by using a word similarity matrix calculated from cross-lingual word embeddings.…”

Section: Related Workmentioning

confidence: 99%

Data Filtering using Cross-Lingual Word Embeddings

Herold

Rosendahl

Vanvinckenroye³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Data filtering for machine translation (MT) describes the task of selecting a subset of a given, possibly noisy corpus with the aim to maximize the performance of an MT system trained on this selected data. Over the years, many different filtering approaches have been proposed. However, varying task definitions and data conditions make it difficult to draw a meaningful comparison. In the present work, we aim for a more systematic approach to the task at hand. First, we analyze the performance of language identification, a tool commonly used for data filtering in the MT community and identify specific weaknesses. Based on our findings, we then propose several novel methods for data filtering, based on cross-lingual word embeddings. We compare our approaches to one of the winning methods from the WMT 2018 shared task on parallel corpus filtering on three real-life, high resource MT tasks. We find that said method, which was performing very strong in the WMT shared task, does not perform well within our more realistic task conditions. While we find that our approaches come out at the top on all three tasks, different variants perform best on different tasks. Further experiments on the WMT 2020 shared task for parallel corpus filtering show that our methods achieve comparable results to the strongest submissions of this campaign.

show abstract

“…Statistical models such as IBM models (Brown et al, 1993), Giza++ (Och andNey, 2003), fast-align (Dyer et al, 2013) and Eflomal (Östling and Tiedemann, 2016b) are widely used. Recently, neural models were proposed, such as SimAlign (Jalili Sabet et al, 2020), Awesome-align (Dou and Neubig, 2021), and methods that are based on neural machine translation (Garg et al, 2019;Zenkel et al, 2020). We use Eflomal and SimAlign for generating alignments.…”

Section: Related Workmentioning

confidence: 99%

ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus

ImaniGooghari¹,

Sabet²,

Dufter³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

With more than 7000 languages worldwide, multilingual natural language processing (NLP) is essential both from an academic and commercial perspective. Researching typological properties of languages is fundamental for progress in multilingual NLP. Examples include assessing language similarity for effective transfer learning, injecting inductive biases into machine learning models or creating resources such as dictionaries and inflection tables. We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages. We give evidence that this is useful for typological research. ParCourE can be set up for any parallel corpus and can thus be used for typological research on other corpora as well as for exploring their quality and properties.

show abstract

SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings

Cited by 45 publications

References 50 publications

Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Data Filtering using Cross-Lingual Word Embeddings

ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus

Contact Info

Product

Resources

About