Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.369
|View full text |Cite
|
Sign up to set email alerts
|

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation

Abstract: We introduce a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs, which is 2.9M pairs larger than the benchmark Vietnamese-English machine translation corpus IWSLT15. We conduct experiments comparing strong neural baselines and well-known automatic translation engines on our dataset and find that in both automatic and human evaluations: the best performance is obtained by fine-tuning the pretrained sequence-to-sequence denoising autoencoder mBART. To our best knowledge, t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 22 publications
0
7
0
Order By: Relevance
“…https://github.com/jiaaro/pydub3 We use the English-to-Vietnamese translation direction because Google-translating from English to Vietnamese produces better translation than from Vietnamese to English. This is confirmed via BLEU scores in the first two rows in Tables3 and 4from[23] or humanevaluation results for Google Translate in Table2from[22] 4. For checking the timestamp misalignment, we reuse our PyDubbased tool developed for correcting the timestamps of the first and/or last words of sentences in the third phase (Section 2.3).…”
mentioning
confidence: 77%
See 3 more Smart Citations
“…https://github.com/jiaaro/pydub3 We use the English-to-Vietnamese translation direction because Google-translating from English to Vietnamese produces better translation than from Vietnamese to English. This is confirmed via BLEU scores in the first two rows in Tables3 and 4from[23] or humanevaluation results for Google Translate in Table2from[22] 4. For checking the timestamp misalignment, we reuse our PyDubbased tool developed for correcting the timestamps of the first and/or last words of sentences in the third phase (Section 2.3).…”
mentioning
confidence: 77%
“…To align parallel sentences within a parallel English-Vietnamese document pair, following [22], we first use Google Translate to translate English source sentences into Vietnamese. 3 Then, to produce parallel English-Vietnamese sentence pairs, we use three alignment toolkits of Hunalign [24], Gargantua [25] and Bleualign [26] to perform an intermediate alignment between the Vietnamese Google-translated versions of the English source sentences and the Vietnamese target sentences.…”
Section: Aligning Parallel English-vietnamese Sentence Pairsmentioning
confidence: 99%
See 2 more Smart Citations
“…One of the first notable parallel datasets and En-Vi neural machine translation is ISWLT’15 (Luong and Manning, 2015) with 133K sentence pairs. A few years later, PhoMT (Doan et al, 2021) and VLSP2020 (Ha et al, 2020) released larger parallel datasets, extracted from publicly available resources for the English-Vietnamese translation.…”
Section: English-vietnamese Translationmentioning
confidence: 99%