Aligning Very Small Parallel Corpora Using Cross-Lingual Word Embeddings and a Monogamy Objective

Poerner, Nina; Sabet, Masoud Jalili; Roth, Benjamin; Schütze, Hinrich

doi:10.48550/arxiv.1811.00066

Cited by 2 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Awesome-align uses representations from mBERT (Devlin et al, 2019) so, it could scale to the languages the latter is pretrained on. For other languages, such as very low-resource pairs, it could be worth exploring lowresource word aligners (Asgari et al, 2020;Poerner et al, 2018) -though we leave the exploration of the same as part of future work. As for the `base' model, we could use models trained from scratch as a viable alternative (see Table 4) and potentially obtain comparable performance.…”

Section: Resource Dependenciesmentioning

confidence: 99%

Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

Iyer,

Oncevay,

Birch

2023

Findings of the Association for Computational Linguistics: EACL 2023

View full text Add to dashboard Cite

Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic codeswitched data can yield impressive performance gains -owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons -which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-tomany word translations are generated using a `base' NMT model. We conduct experiments on 3 different language families -Romance, Uralic, and Indo-Aryan -and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. Lastly, through ablation studies, we highlight the major code-switching aspects (including context, many-to-many substitutions, code-switching language count etc.) that contribute to the enhanced pretraining of multilingual NMT models.

show abstract

Section: Resource Dependenciesmentioning

confidence: 99%

Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

Iyer,

Oncevay,

Birch

2023

Findings of the Association for Computational Linguistics: EACL 2023

View full text Add to dashboard Cite

show abstract

“…The most popular dataset for low resource alignment is the Bible Parallel Corpus containing a large number (1000+) of languages, but are characteristically lowresource, i.e., have little text per language (Mayer and Cysouw, 2014). Some recent work touched upon this problem using unsupervised cross-lingual embeddings and a monogamy objective (Poerner et al, 2018). However, this method could not improve the fast-align results for the parallel corpora containing more than 250 sentences.…”

Section: Related Workmentioning

confidence: 99%

“…One of the main challenges for annotation projection is that corpora are often relatively small for low resource languages. The existing IBM-based alignment models work well for high-resource settings, but they fail in the low-resource case (Poerner et al, 2018). The most popular dataset for low resource alignment, the Bible Parallel Corpus, containing a large number (1000+) of languages, are characteristically low-resource, i.e., having only around 5000-10000 parallel sentences per language pair.…”

Section: Introductionmentioning

confidence: 99%

Subword Sampling for Low Resource Word Alignment

Asgari,

Sabet,

Dufter

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Annotation projection is an important area in NLP that can greatly contribute to creating language resources for low-resource languages. Word alignment plays a key role in this setting. However, most of the existing word alignment methods are designed for a high resource setting in machine translation where millions of parallel sentences are available. This amount reduces to a few thousands of sentences when dealing with lowresource languages failing the existing established IBM models. In this paper, we propose subword sampling-based alignment of text units. This method's hypothesis is that the aggregation of different granularities of text for certain language pairs can help word-level alignment. For certain languages for which gold-standard alignments exist, we propose an iterative Bayesian optimization framework to optimize selecting possible subwords from the space of possible subword representations of the source and target sentences. We show that the subword sampling method consistently outperforms word-level alignment on six language pairs: English-German, English-French, English-Romanian, English-Persian, English-Hindi, and English-Inuktitut. In addition, we show that the hyperparameters learned for certain language pairs can be applied to other languages at no supervision and consistently improve the alignment results. We observe that using 5K parallel sentences together with our proposed subword sampling approach, we obtain similar F1 scores to the use of 100K's of parallel sentences in existing word-level fastalign/eflomal alignment methods.

show abstract

Aligning Very Small Parallel Corpora Using Cross-Lingual Word Embeddings and a Monogamy Objective

Cited by 2 publications

References 0 publications

Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

Subword Sampling for Low Resource Word Alignment

Contact Info

Product

Resources

About