Subword Sampling for Low Resource Word Alignment

Asgari, Ehsaneddin; Sabet, Masoud Jalili; Dufter, Philipp; Ringlstetter, Christopher; Schütze, Hinrich

doi:10.48550/arxiv.2012.11657

Cited by 3 publications

(3 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They also compare them with respect to unaligned words, rare words, Part-of-Speech (PoS) tags, and distortion errors. Asgari et al (2020) study word alignment results when using subword-level tokenization and show improved performance with respect to word level. analyze the performance of word aligners regarding different PoS for English/German and show that Eflomal has low performance when aligning links with high distortion.…”

Section: Word Alignment Analysismentioning

confidence: 99%

SilverAlign: MT-Based Silver Data Algorithm For Evaluating Word Alignment

Köksal¹,

Severini²,

Schütze³

2022

Preprint

View full text Add to dashboard Cite

Word alignments are essential for a variety of NLP tasks. Therefore, choosing the best approaches for their creation is crucial. However, the scarce availability of gold evaluation data makes the choice difficult. We propose Silver-Align, a new method to automatically create silver data for the evaluation of word aligners by exploiting machine translation and minimal pairs. We show that performance on our silver data correlates well with gold benchmarks for 9 language pairs, making our approach a valid resource for evaluation of different domains and languages when gold data are not available. This addresses the important scenario of missing gold data alignments for low-resource languages.1 Blissymbols is a constructed language established in 1949 to help people with communication difficulties. The blissonline.se dictionary is used for the examples.

show abstract

Section: Word Alignment Analysismentioning

confidence: 99%

SilverAlign: MT-Based Silver Data Algorithm For Evaluating Word Alignment

Köksal¹,

Severini²,

Schütze³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Awesome-align uses representations from mBERT (Devlin et al, 2019) so, it could scale to the languages the latter is pretrained on. For other languages, such as very low-resource pairs, it could be worth exploring lowresource word aligners (Asgari et al, 2020;Poerner et al, 2018) -though we leave the exploration of the same as part of future work. As for the `base' model, we could use models trained from scratch as a viable alternative (see Table 4) and potentially obtain comparable performance.…”

Section: Resource Dependenciesmentioning

confidence: 99%

Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

Iyer,

Oncevay,

Birch

2023

Findings of the Association for Computational Linguistics: EACL 2023

View full text Add to dashboard Cite

Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic codeswitched data can yield impressive performance gains -owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons -which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-tomany word translations are generated using a `base' NMT model. We conduct experiments on 3 different language families -Romance, Uralic, and Indo-Aryan -and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. Lastly, through ablation studies, we highlight the major code-switching aspects (including context, many-to-many substitutions, code-switching language count etc.) that contribute to the enhanced pretraining of multilingual NMT models.

show abstract

“…Previous work towards this goal includes algorithms which offer robustness within an existing subword vocabulary (Provilkov et al, 2020;He et al, 2020;Hiraoka, 2022), necessitating modification of either training, inference, or both procedures in the context of LLMs. Others have considered tuning the size of a subword vocabulary (Salesky et al, 2020), or selecting from an enlarged set of possible segmentations (Asgari et al, 2020), for optimizing performance on downstream tasks.…”

Section: Related Workmentioning

confidence: 99%

Incorporating Context into Subword Vocabularies

Yehezkel,

Pinter

2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the resulting vocabularies are used in language models' highly contextualized settings. We present SAGE, a tokenizer that tailors subwords for their downstream use by baking in the contextualized signal at the vocabulary creation phase. We show that SAGE does a better job than current widespread tokenizers in keeping token contexts cohesive, while not incurring a large price in terms of encoding efficiency or domain robustness. SAGE improves performance on English GLUE classification tasks as well as on NER, and on Inference and NER in Turkish, demonstrating its robustness to language properties such as morphological exponence and agglutination.

show abstract

Subword Sampling for Low Resource Word Alignment

Cited by 3 publications

References 18 publications

SilverAlign: MT-Based Silver Data Algorithm For Evaluating Word Alignment

SilverAlign: MT-Based Silver Data Algorithm For Evaluating Word Alignment

Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

Incorporating Context into Subword Vocabularies

Contact Info

Product

Resources

About