2014
DOI: 10.1016/j.protcy.2014.11.024
|View full text |Cite
|
Sign up to set email alerts
|

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Abstract: Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subjectaligned c… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
12
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 25 publications
(13 citation statements)
references
References 7 publications
1
12
0
Order By: Relevance
“…Other effects are smaller and statistically insignificant, suggesting that the particular choice of supplementary out-of-domain data may not matter as much as simply using a large amount. One notable exception is the parallel Wikipedia corpus (Wołk and Marasek, 2014), which exhibits a large negative trend on recall and F1, possibly due to its noisy, automatically-aligned provenance.…”
Section: External Training Corporamentioning
confidence: 99%
“…Other effects are smaller and statistically insignificant, suggesting that the particular choice of supplementary out-of-domain data may not matter as much as simply using a large amount. One notable exception is the parallel Wikipedia corpus (Wołk and Marasek, 2014), which exhibits a large negative trend on recall and F1, possibly due to its noisy, automatically-aligned provenance.…”
Section: External Training Corporamentioning
confidence: 99%
“…OPUS 7 contains more than 2.7 billion parallel sentences in 90 languages. The specific corpus we extracted consists of data from multiple domains and sources including: ParaCrawl project (Esplà-Gomis et al, 2019), EUbookshop (Skadiņš et al, 2014), Tilde Model (Rozis and Skadinš, 2017), translation memories (DGT) (Steinberger et al, 2013), Open-Subtitles (Creutz, 2018), SciELO Parallel (Soares et al, 2018), JRC-Acquis Multilingual (Steinberger et al, 2006), Tanzil (Zarrabi-Zadeh, 2007, Eu-roparl Parallel (Koehn, 2005), TED 2013 (Cettolo et al, 2012), Wikipedia (Wołk and Marasek, 2014), Tatoeba 8 , QCRI Educational Domain (Abdelali et al, 2014), GNOME localization files, 9 Global Voices, 10 KDE4, 11 , Ubuntu, 12 and Multilingual Bible (Christodouloupoulos and Steedman, 2015).…”
Section: Opus Datamentioning
confidence: 99%
“…Since the size of the LFAligner Italian-English dictionary was rather small (around 14500 terms) and we did not find other accurate and manually annotated freely available English-Italian lexicons, we investigated if a large automatically created lexicon could be useful. We compiled a large English-Italian corpus (containing 3131200 parallel sentences) by concatenating the Europarl (Koehn, 2005), the Wikipedia (Wołk and Marasek, 2014), the GlobalVoices 12 , and the books 13 corpora from OPUS (Tiedemann, 2012). We used Giza++ (Och and Ney, 2003) to align the corpus, followed by using Moses SMT (Koehn et al, 2007) to symmetrize the directional alignments, and extract a lexical translation table.…”
Section: Hunalign With Lfaligner Dictionarymentioning
confidence: 99%