2017
DOI: 10.48550/arxiv.1710.02855
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The IIT Bombay English-Hindi Parallel Corpus

Abstract: We present the IIT Bombay English-Hindi Parallel Corpus. The corpus is a compilation of parallel corpora previously available in the public domain as well as new parallel corpora we collected. The corpus contains 1.49 million parallel segments, of which 694k segments were not previously available in the public domain. The corpus has been pre-processed for machine translation, and we report baseline phrase-based SMT and NMT translation results on this corpus. This corpus has been used in two editions of shared … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
8
2

Relationship

0
10

Authors

Journals

citations
Cited by 13 publications
(13 citation statements)
references
References 3 publications
(4 reference statements)
0
13
0
Order By: Relevance
“…For the monolingual corpus, we extracted it from CC-100 according to ; . For bilingual corpus, we use the same corpus as INFOXLM (Chi et al, 2020b), including MultiUN (Ziemski et al, 2016), IIT Bombay (Kunchukuttan et al, 2017), OPUS (Tiedemann, 2012) and WikiMatrix . In order to balance the size of data for high-resource and lowresource languages, both monolingual and parallel corpora were sampled with parameter alpha of 0.1 referred to the method of (Lample and Conneau, 2019).…”
Section: Data and Modelmentioning
confidence: 99%
“…For the monolingual corpus, we extracted it from CC-100 according to ; . For bilingual corpus, we use the same corpus as INFOXLM (Chi et al, 2020b), including MultiUN (Ziemski et al, 2016), IIT Bombay (Kunchukuttan et al, 2017), OPUS (Tiedemann, 2012) and WikiMatrix . In order to balance the size of data for high-resource and lowresource languages, both monolingual and parallel corpora were sampled with parameter alpha of 0.1 referred to the method of (Lample and Conneau, 2019).…”
Section: Data and Modelmentioning
confidence: 99%
“…In our English to Hindi machine translation experiments, we have used the publicly available IIT Bombay (IITB) English-Hindi Parallel Corpus (Kunchukuttan et al, 2017). The training data in the IITB corpus consists of nearly 1.5M training samples.…”
Section: Dataset Detailsmentioning
confidence: 99%
“…from IITB (Kunchukuttan et al, 2017); English-Russian, English-Arabic, and English-Chinese parallel data from the UN Corpus (Ziemski et al, 2016); English-Tamil and English-Telugu from Wikimatrix (Schwenk et al, 2019). We report the counts in Table 1.…”
Section: Datasets and Preprocessingmentioning
confidence: 99%