2020
DOI: 10.48550/arxiv.2012.11657
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Subword Sampling for Low Resource Word Alignment

Ehsaneddin Asgari,
Masoud Jalili Sabet,
Philipp Dufter
et al.

Abstract: Annotation projection is an important area in NLP that can greatly contribute to creating language resources for low-resource languages. Word alignment plays a key role in this setting. However, most of the existing word alignment methods are designed for a high resource setting in machine translation where millions of parallel sentences are available. This amount reduces to a few thousands of sentences when dealing with lowresource languages failing the existing established IBM models. In this paper, we propo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
1
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 18 publications
0
1
0
Order By: Relevance
“…They also compare them with respect to unaligned words, rare words, Part-of-Speech (PoS) tags, and distortion errors. Asgari et al (2020) study word alignment results when using subword-level tokenization and show improved performance with respect to word level. analyze the performance of word aligners regarding different PoS for English/German and show that Eflomal has low performance when aligning links with high distortion.…”
Section: Word Alignment Analysismentioning
confidence: 99%
“…They also compare them with respect to unaligned words, rare words, Part-of-Speech (PoS) tags, and distortion errors. Asgari et al (2020) study word alignment results when using subword-level tokenization and show improved performance with respect to word level. analyze the performance of word aligners regarding different PoS for English/German and show that Eflomal has low performance when aligning links with high distortion.…”
Section: Word Alignment Analysismentioning
confidence: 99%
“…Awesome-align uses representations from mBERT (Devlin et al, 2019) so, it could scale to the languages the latter is pretrained on. For other languages, such as very low-resource pairs, it could be worth exploring lowresource word aligners (Asgari et al, 2020;Poerner et al, 2018) -though we leave the exploration of the same as part of future work. As for the `base' model, we could use models trained from scratch as a viable alternative (see Table 4) and potentially obtain comparable performance.…”
Section: Resource Dependenciesmentioning
confidence: 99%
“…Previous work towards this goal includes algorithms which offer robustness within an existing subword vocabulary (Provilkov et al, 2020;He et al, 2020;Hiraoka, 2022), necessitating modification of either training, inference, or both procedures in the context of LLMs. Others have considered tuning the size of a subword vocabulary (Salesky et al, 2020), or selecting from an enlarged set of possible segmentations (Asgari et al, 2020), for optimizing performance on downstream tasks.…”
Section: Related Workmentioning
confidence: 99%