Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Wołk, Krzysztof; Marasek, Krzysztof

doi:10.1016/j.protcy.2014.11.024

Cited by 25 publications

(13 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other effects are smaller and statistically insignificant, suggesting that the particular choice of supplementary out-of-domain data may not matter as much as simply using a large amount. One notable exception is the parallel Wikipedia corpus (Wołk and Marasek, 2014), which exhibits a large negative trend on recall and F1, possibly due to its noisy, automatically-aligned provenance.…”

Section: External Training Corporamentioning

confidence: 99%

Simultaneous Translation and Paraphrase for Language Education

Mayhew

Bicknell

Brust

et al. 2020

Proceedings of the Fourth Workshop on Neural Generation and Translation

View full text Add to dashboard Cite

We present the task of Simultaneous Translation and Paraphrasing for Language Education (STAPLE). Given a prompt in one language, the goal is to generate a diverse set of correct translations that language learners are likely to produce. This is motivated by the need to create and maintain large, high-quality sets of acceptable translations for exercises in a language-learning application, and synthesizes work spanning machine translation, MT evaluation, automatic paraphrasing, and language education technology.We developed a novel corpus with unique properties for five languages (Hungarian, Japanese, Korean, Portuguese, and Vietnamese), and report on the results of a shared task challenge which attracted 20 teams to solve the task. In our meta-analysis, we focus on three aspects of the resulting systems: external training corpus selection, model architecture and training decisions, and decoding and filtering strategies. We find that strong systems start with a large amount of generic training data, and then finetune with in-domain data, sampled according to our provided learner response frequencies.

show abstract

Section: External Training Corporamentioning

confidence: 99%

Simultaneous Translation and Paraphrase for Language Education

Mayhew

Bicknell

Brust

et al. 2020

Proceedings of the Fourth Workshop on Neural Generation and Translation

View full text Add to dashboard Cite

show abstract

“…OPUS 7 contains more than 2.7 billion parallel sentences in 90 languages. The specific corpus we extracted consists of data from multiple domains and sources including: ParaCrawl project (Esplà-Gomis et al, 2019), EUbookshop (Skadiņš et al, 2014), Tilde Model (Rozis and Skadinš, 2017), translation memories (DGT) (Steinberger et al, 2013), Open-Subtitles (Creutz, 2018), SciELO Parallel (Soares et al, 2018), JRC-Acquis Multilingual (Steinberger et al, 2006), Tanzil (Zarrabi-Zadeh, 2007, Eu-roparl Parallel (Koehn, 2005), TED 2013 (Cettolo et al, 2012), Wikipedia (Wołk and Marasek, 2014), Tatoeba 8 , QCRI Educational Domain (Abdelali et al, 2014), GNOME localization files, 9 Global Voices, 10 KDE4, 11 , Ubuntu, 12 and Multilingual Bible (Christodouloupoulos and Steedman, 2015).…”

Section: Opus Datamentioning

confidence: 99%

Growing Together: Modeling Human Language Learning With n-Best Multi-Checkpoint Machine Translation

Nagoudi¹,

Abdul-Mageed²,

Cavusoglu³

2020

Proceedings of the Fourth Workshop on Neural Generation and Translation

View full text Add to dashboard Cite

We describe our submission to the 2020 Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE) (Mayhew et al., 2020). We view MT models at various training stages (i.e., checkpoints) as human learners at different levels. Hence, we employ an ensemble of multicheckpoints from the same model to generate translation sequences with various levels of fluency. From each checkpoint, for our best model, we sample n-Best sequences (n = 10) with a beam width = 100. We achieve 37.57 macro F 1 with a 6 checkpoint model ensemble on the official English to Portuguese shared task test data, outperforming a baseline Amazon translation system of 21.30 macro F 1 and ultimately demonstrating the utility of our intuitive method.

show abstract

“…Since the size of the LFAligner Italian-English dictionary was rather small (around 14500 terms) and we did not find other accurate and manually annotated freely available English-Italian lexicons, we investigated if a large automatically created lexicon could be useful. We compiled a large English-Italian corpus (containing 3131200 parallel sentences) by concatenating the Europarl (Koehn, 2005), the Wikipedia (Wołk and Marasek, 2014), the GlobalVoices 12 , and the books 13 corpora from OPUS (Tiedemann, 2012). We used Giza++ (Och and Ney, 2003) to align the corpus, followed by using Moses SMT (Koehn et al, 2007) to symmetrize the directional alignments, and extract a lexical translation table.…”

Section: Hunalign With Lfaligner Dictionarymentioning

confidence: 99%

IESTAC: English-Italian Parallel Corpus for End-to-End Speech-to-Text Machine Translation

Corte¹,

Stymne²

2020

Proceedings of the First International Workshop on Natural Language Processing Beyond Text

View full text Add to dashboard Cite

We discuss a set of methods for the creation of IESTAC: a English-Italian speech and text parallel corpus designed for the training of end-toend speech-to-text machine translation models and publicly released as part of this work. We first mapped English LibriVox audiobooks and their corresponding English Gutenberg Project e-books to Italian e-books with a set of three complementary methods. Then we aligned the English and the Italian texts using both traditional Gale-Church based alignment methods and a recently proposed tool to perform bilingual sentences alignment computing the cosine similarity of multilingual sentence embeddings. Finally, we forced the alignment between the English audiobooks and the English side of our textual parallel corpus with a textto-speech and dynamic time warping based forced alignment tool. For each step, we provide the reader with a critical discussion based on detailed evaluation and comparison of the results of the different methods.

show abstract

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Cited by 25 publications

References 7 publications

Simultaneous Translation and Paraphrase for Language Education

Simultaneous Translation and Paraphrase for Language Education

Growing Together: Modeling Human Language Learning With n-Best Multi-Checkpoint Machine Translation

IESTAC: English-Italian Parallel Corpus for End-to-End Speech-to-Text Machine Translation

Contact Info

Product

Resources

About