2018
DOI: 10.48550/arxiv.1811.00066
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Aligning Very Small Parallel Corpora Using Cross-Lingual Word Embeddings and a Monogamy Objective

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 0 publications
0
3
0
Order By: Relevance
“…Awesome-align uses representations from mBERT (Devlin et al, 2019) so, it could scale to the languages the latter is pretrained on. For other languages, such as very low-resource pairs, it could be worth exploring lowresource word aligners (Asgari et al, 2020;Poerner et al, 2018) -though we leave the exploration of the same as part of future work. As for the `base' model, we could use models trained from scratch as a viable alternative (see Table 4) and potentially obtain comparable performance.…”
Section: Resource Dependenciesmentioning
confidence: 99%
“…Awesome-align uses representations from mBERT (Devlin et al, 2019) so, it could scale to the languages the latter is pretrained on. For other languages, such as very low-resource pairs, it could be worth exploring lowresource word aligners (Asgari et al, 2020;Poerner et al, 2018) -though we leave the exploration of the same as part of future work. As for the `base' model, we could use models trained from scratch as a viable alternative (see Table 4) and potentially obtain comparable performance.…”
Section: Resource Dependenciesmentioning
confidence: 99%
“…The most popular dataset for low resource alignment is the Bible Parallel Corpus containing a large number (1000+) of languages, but are characteristically lowresource, i.e., have little text per language (Mayer and Cysouw, 2014). Some recent work touched upon this problem using unsupervised cross-lingual embeddings and a monogamy objective (Poerner et al, 2018). However, this method could not improve the fast-align results for the parallel corpora containing more than 250 sentences.…”
Section: Related Workmentioning
confidence: 99%
“…One of the main challenges for annotation projection is that corpora are often relatively small for low resource languages. The existing IBM-based alignment models work well for high-resource settings, but they fail in the low-resource case (Poerner et al, 2018). The most popular dataset for low resource alignment, the Bible Parallel Corpus, containing a large number (1000+) of languages, are characteristically low-resource, i.e., having only around 5000-10000 parallel sentences per language pair.…”
Section: Introductionmentioning
confidence: 99%