Using Term Position Similarity and Language Modeling for Bilingual
            Document Alignment

Le, Thanh C.; Vu, Hoa Trong; Oberländer, Jonathan; Bojar, Ondřej

doi:10.18653/v1/w16-2371

Cited by 5 publications

(3 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…UFAL (Le et al, 2016) submitted 4 systems, each using a different method. UFAL-1 (81.3%) uses identical word matches by also considering their position in the text.…”

Section: Resultsmentioning

confidence: 99%

Findings of the WMT 2016 Bilingual Document Alignment Shared Task

Buck¹,

Koehn²

2016

Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

View full text Add to dashboard Cite

show abstract

“…UFAL (Le et al, 2016) submitted 4 systems, each using a different method. UFAL-1 (81.3%) uses identical word matches by also considering their position in the text.…”

Section: Resultsmentioning

confidence: 99%

Findings of the WMT 2016 Bilingual Document Alignment Shared Task

Buck¹,

Koehn²

2016

Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

View full text Add to dashboard Cite

show abstract

“…Examples of structural matching is the use of editdistance between linearized documents (Resnik and Smith, 2003) and probability of a probabilistic DOM-tree alignment model (Shi et al, 2006). Using the URL for matching is a very powerful indicator for some domains, typically by using a predefined set of patterns for language marking or simple Levenshtein distance (Le et al, 2016).…”

Section: Document Alignmentmentioning

confidence: 99%

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Bañón¹,

Chen²,

Haddow³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…However, most more recent work has focused on content similarity via bag-of-words or bag-of-ngrams, using bilingual lexicon (Ma and Liberman, 1999;Fung and Cheung, 2004;Ion et al, 2011;Esplà-Gomis et al, 2016;Azpeitia and Etchegoyhen, 2019), machine translation (Uszkoreit et al, 2010), or phrase tables (Gomes and Pereira Lopes, 2016). Some work has considered high-level order as a filtering step after using a unordered representation to generate candidates: Ma and Liberman (1999) and Le et al (2016) discard n-gram pairs outside a fixed window, while Uszkoreit et al (2010) filters out documents that have high edit distance between sequences of corresponding n-gram pairs. Utiyama and Isahara (2003) and Zhang et al (2006) use sentence similarity and/or number of aligned sentences after performing sentence alignment to score candidate documents.…”

Section: Related Workmentioning

confidence: 99%

Exploiting Sentence Order in Document Alignment

Thompson¹,

Koehn²

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

We present a simple document alignment method that incorporates sentence order information in both candidate generation and candidate re-scoring. Our method results in 61% relative reduction in error compared to the best previously published result on the WMT16 document alignment shared task. Our method improves downstream MT performance on web-scraped Sinhala-English documents from ParaCrawl, outperforming the document alignment method used in the most recent ParaCrawl release. It also outperforms a comparable corpora method which uses the same multilingual embeddings, demonstrating that exploiting sentence order is beneficial even if the end goal is sentence-level bitext.

show abstract

Using Term Position Similarity and Language Modeling for Bilingual Document Alignment

Cited by 5 publications

References 10 publications

Findings of the WMT 2016 Bilingual Document Alignment Shared Task

Findings of the WMT 2016 Bilingual Document Alignment Shared Task

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Exploiting Sentence Order in Document Alignment

Contact Info

Product

Resources

About