2015
DOI: 10.1145/2833089
|View full text |Cite
|
Sign up to set email alerts
|

Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora

Abstract: Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract either parallel sentences or fragments from them for SMT. In this article, we propose an integrated system to extract both parallel sentences and fragments from comparable corpora. We first apply parallel sentence extraction to identify parallel sentences from comparable sentences. We… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
14
0

Year Published

2015
2015
2022
2022

Publication Types

Select...
6
3
1

Relationship

0
10

Authors

Journals

citations
Cited by 18 publications
(14 citation statements)
references
References 25 publications
0
14
0
Order By: Relevance
“…We collected all the publicly available, parallel Chinese-Japanese corpora we could find, and made it available to participants as the existing parallel. These include Global Voices, News Commentary, and Ubuntu corpora from OPUS Tiedemann (2012); OpenSubtitles (Lison and Tiedemann, 2016); TED talks (Dabre and Kurohashi, 2017); Wikipedia (Chu et al, 2014(Chu et al, , 2015; Wiktionary.org; and WikiMatrix (Schwenk et al, 2019). We also collected parallel sentences from Tatoeba.org, released under a CC-BY License.…”
Section: Parallel Training Datamentioning
confidence: 99%
“…We collected all the publicly available, parallel Chinese-Japanese corpora we could find, and made it available to participants as the existing parallel. These include Global Voices, News Commentary, and Ubuntu corpora from OPUS Tiedemann (2012); OpenSubtitles (Lison and Tiedemann, 2016); TED talks (Dabre and Kurohashi, 2017); Wikipedia (Chu et al, 2014(Chu et al, , 2015; Wiktionary.org; and WikiMatrix (Schwenk et al, 2019). We also collected parallel sentences from Tatoeba.org, released under a CC-BY License.…”
Section: Parallel Training Datamentioning
confidence: 99%
“…They attempt to find the translations in tweets instead of translating the texts. [8] extract both parallel sentences and fragments from comparable corpora of Chinese-Japanese Wikipedia to improve statistical MT. [12] apply a domain-biased parallel data collection and a structured methodology to obtain English-Hindi parallel data.…”
Section: Related Workmentioning
confidence: 99%
“…In recent years, there have been several approaches developed for obtaining parallel sentences or fragments from non-parallel data [8], [9], such as comparable data [8], [10], [11], [12] and quasi-comparable data [13] to make contributions to SMT. Parallel corpora contain parallel sentences, i.e., sentences which are translations of each other.…”
Section: Related Workmentioning
confidence: 99%