2019
DOI: 10.3390/informatics6030035
|View full text |Cite
|
Sign up to set email alerts
|

Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

Abstract: To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2020
2020
2020
2020

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 20 publications
0
1
0
Order By: Relevance
“…Crosslingual word embeddings have been used to calculate distance between equivalences in different languages (Luong et al, 2015;Artetxe et al, 2016). Defauw et al (2019) treat filtering as a supervised regression problem and show that Levenshtein distance (Levenshtein, 1966) between the target and MT-translated source, as well as cosine distance between sentence embeddings of the source and target, are important features. While they use InferSent (Conneau et al, 2017), BERT (Devlin et al, 2019) has recently been employed for calculating crosslingual semantic textual similarity to detect misalignment with good results (Lo and Simard, 2019).…”
Section: Filteringmentioning
confidence: 99%
“…Crosslingual word embeddings have been used to calculate distance between equivalences in different languages (Luong et al, 2015;Artetxe et al, 2016). Defauw et al (2019) treat filtering as a supervised regression problem and show that Levenshtein distance (Levenshtein, 1966) between the target and MT-translated source, as well as cosine distance between sentence embeddings of the source and target, are important features. While they use InferSent (Conneau et al, 2017), BERT (Devlin et al, 2019) has recently been employed for calculating crosslingual semantic textual similarity to detect misalignment with good results (Lo and Simard, 2019).…”
Section: Filteringmentioning
confidence: 99%