“…We employ ratio-based filters on tokenized sentence pairs following Cruz and Sutawika (2022) and Sutawika and Cruz (2021). We first tokenize using SacreMoses 2 then apply the following ratio-based filters: 7,143,725 115,239,312 95,954,020 Synthetic he→en 73,278,018 1,471,827,973 1,056,677,671 Synthetic he→en Filtered 47,372,416 659,409,236 541,376,459 Table 1: Corpus Statistics.…”