Multi-Task Word Alignment Triangulation for Low-Resource Languages

Levinboim, Tomer; Chiang, David

doi:10.3115/v1/n15-1129

“…Our work is a natural extension of previous word alignment work. A robust alignment tool for lowresource languages benefits MT systems (Xiang et al, 2010a;Levinboim and Chiang, 2015;Beloucif et al, 2016a;Nagata et al, 2020), or speech recognition , especially if sentence-level alignment tools like LASER (Artetxe and Schwenk, 2019;Chaudhary et al, 2019) do not cover all languages, so one may need to fall-back to word-level alignment heuristics to inform sentence-alignment models like Hunalign (Varga et al, 2007).…”

Section: English-german Test Setmentioning

confidence: 99%

“…As awesome-align achieves the overall highest performance, we choose to focus on awesome-align in this work. Some works involve improving word-level alignment for low-resource languages such as utilizing semantic information (Beloucif et al, 2016b;Pourdamghani et al, 2018), multi-task learning (Levinboim andChiang, 2015), and combining complementary word alignments (Xiang et al, 2010b). None of the previous work, though, to our knowledge, tackles the problem of aligning data with OCR-like noise on one or both sides.…”

Section: English-german Test Setmentioning

confidence: 99%

Noisy Parallel Data Alignment

Ruoyu¹,

Anastasopoulos²

2023

Preprint

0

View full text Add to dashboard Cite

An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and nonstandardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machinereadable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%. 1

show abstract

“…Our work is a natural extension of previous word alignment work. A robust alignment tool for lowresource languages benefits MT systems (Xiang et al, 2010a;Levinboim and Chiang, 2015;Beloucif et al, 2016a;Nagata et al, 2020), or speech recognition , especially if sentence-level alignment tools like LASER (Artetxe and Schwenk, 2019;Chaudhary et al, 2019) do not cover all languages, so one may need to fall-back to word-level alignment heuristics to inform sentence-alignment models like Hunalign (Varga et al, 2007).…”

Section: Related Workmentioning

confidence: 99%

Noisy Parallel Data Alignment

Xie,

Anastasopoulos

2023

Findings of the Association for Computational Linguistics: EACL 2023

0

View full text Add to dashboard Cite

An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resourcerich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neuralbased alignment model up to 59.6%. 1

show abstract

“…A variant of this strategy is to view the source parameter values as priors for the target model, an idea that has been used repeatedly in the context of domain adaptation. It has notably been used for transferring parsers (Cohen and Smith, 2009;Burkett et al, 2010; and, more recently, to also transfer alignment models (Levinboim and Chiang, 2015).…”

Section: Transfer In Parameter Spacementioning

confidence: 99%

“…In this study, we explore ways to overcome this paradox and consider techniques for transferring alignment models or annotations across language pairs, a task that has hardly been addressed in literature (see however (Wang et al, 2006;Levinboim and Chiang, 2015)). Based on a high-level typology of cross-lingual transfer methodologies ( § 2), our contribution is to formalize realistic scenarios (defined in § 3) as well as some basic methodologies for projecting knowledge about bilingual alignments crosslinguistically ( § 4).…”

Section: Introductionmentioning

confidence: 99%