Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

Defauw, Arne; Szoc, Sara; Bardadym, Anna; Brabers, Joris; Everaert, Frederic; Mijic, Roko; Scholte, Kim; Vanallemeersch, Tom; Winckel, Koen Van; Bogaert, Joachim Van den

doi:10.3390/informatics6030035

Cited by 1 publication

(1 citation statement)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Crosslingual word embeddings have been used to calculate distance between equivalences in different languages (Luong et al, 2015;Artetxe et al, 2016). Defauw et al (2019) treat filtering as a supervised regression problem and show that Levenshtein distance (Levenshtein, 1966) between the target and MT-translated source, as well as cosine distance between sentence embeddings of the source and target, are important features. While they use InferSent (Conneau et al, 2017), BERT (Devlin et al, 2019) has recently been employed for calculating crosslingual semantic textual similarity to detect misalignment with good results (Lo and Simard, 2019).…”

Section: Filteringmentioning

confidence: 99%

Effectively Aligning and Filtering Parallel Corpora under Sparse Data Conditions

Steingrímsson¹,

Loftsson²,

Way³

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

View full text Add to dashboard Cite

Parallel corpora are key to developing good machine translation systems. However, abundant parallel data are hard to come by, especially for languages with a low number of speakers. When rich morphology exacerbates the data sparsity problem, it is imperative to have accurate alignment and filtering methods that can help make the most of what is available by maximising the number of correctly translated segments in a corpus and minimising noise by removing incorrect translations and segments containing extraneous data. This paper sets out a research plan for improving alignment and filtering methods for parallel texts in low-resource settings. We propose an effective unsupervised alignment method to tackle the alignment problem. Moreover, we propose a strategy to supplement state-of-theart models with automatically extracted information using basic NLP tools to effectively handle rich morphology.

show abstract