Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers 2016
DOI: 10.18653/v1/w16-2371
|View full text |Cite
|
Sign up to set email alerts
|

Using Term Position Similarity and Language Modeling for Bilingual Document Alignment

Abstract: The WMT Bilingual Document Alignment Task requires systems to assign source pages to their "translations", in a big space of possible pairs. We present four methods: The first one uses the term position similarity between candidate document pairs. The second method requires automatically translated versions of the target text, and matches them with the candidates. The third and fourth methods try to overcome some of the challenges presented by the nature of the corpus, by considering the string similarity of s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 10 publications
0
3
0
Order By: Relevance
“…UFAL (Le et al, 2016) submitted 4 systems, each using a different method. UFAL-1 (81.3%) uses identical word matches by also considering their position in the text.…”
Section: Resultsmentioning
confidence: 99%
“…UFAL (Le et al, 2016) submitted 4 systems, each using a different method. UFAL-1 (81.3%) uses identical word matches by also considering their position in the text.…”
Section: Resultsmentioning
confidence: 99%
“…Examples of structural matching is the use of editdistance between linearized documents (Resnik and Smith, 2003) and probability of a probabilistic DOM-tree alignment model (Shi et al, 2006). Using the URL for matching is a very powerful indicator for some domains, typically by using a predefined set of patterns for language marking or simple Levenshtein distance (Le et al, 2016).…”
Section: Document Alignmentmentioning
confidence: 99%
“…However, most more recent work has focused on content similarity via bag-of-words or bag-of-ngrams, using bilingual lexicon (Ma and Liberman, 1999;Fung and Cheung, 2004;Ion et al, 2011;Esplà-Gomis et al, 2016;Azpeitia and Etchegoyhen, 2019), machine translation (Uszkoreit et al, 2010), or phrase tables (Gomes and Pereira Lopes, 2016). Some work has considered high-level order as a filtering step after using a unordered representation to generate candidates: Ma and Liberman (1999) and Le et al (2016) discard n-gram pairs outside a fixed window, while Uszkoreit et al (2010) filters out documents that have high edit distance between sequences of corresponding n-gram pairs. Utiyama and Isahara (2003) and Zhang et al (2006) use sentence similarity and/or number of aligned sentences after performing sentence alignment to score candidate documents.…”
Section: Related Workmentioning
confidence: 99%