2021
DOI: 10.48550/arxiv.2111.10513
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Data Processing Matters: SRPH-Konvergen AI's Machine Translation System for WMT'21

Abstract: In this paper, we describe the submission of the joint Samsung Research Philippines-Konvergen AI team for the WMT'21 Large Scale Multilingual Translation Task -Small Track 2. We submit a standard Seq2Seq Transformer model to the shared task without any training or architecture tricks, relying mainly on the strength of our data preprocessing techniques to boost performance. Our final submission model scored 22.92 average BLEU on the FLORES-101 devtest set, and scored 22.97 average BLEU on the contest's hidden t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
0
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 9 publications
(9 reference statements)
0
0
0
Order By: Relevance
“…We employ ratio-based filters on tokenized sentence pairs following Cruz and Sutawika (2022) and Sutawika and Cruz (2021). We first tokenize using SacreMoses 2 then apply the following ratio-based filters: 7,143,725 115,239,312 95,954,020 Synthetic he→en 73,278,018 1,471,827,973 1,056,677,671 Synthetic he→en Filtered 47,372,416 659,409,236 541,376,459 Table 1: Corpus Statistics.…”
Section: Ratio-basedmentioning
confidence: 99%
“…We employ ratio-based filters on tokenized sentence pairs following Cruz and Sutawika (2022) and Sutawika and Cruz (2021). We first tokenize using SacreMoses 2 then apply the following ratio-based filters: 7,143,725 115,239,312 95,954,020 Synthetic he→en 73,278,018 1,471,827,973 1,056,677,671 Synthetic he→en Filtered 47,372,416 659,409,236 541,376,459 Table 1: Corpus Statistics.…”
Section: Ratio-basedmentioning
confidence: 99%