“…In Transformer architecture, we use 𝑁 𝑒𝑛𝑐𝑜𝑑𝑒𝑟 = 3, 𝑁 𝑑𝑒𝑐𝑜𝑑𝑒𝑟 = 3, 𝑑 𝑚𝑜𝑑𝑒𝑙 = 512 and the number of attention heads is 8. During inference, we use three values 𝑚𝑎𝑥𝑙𝑒𝑛𝑔𝑡ℎ ∈ [20,22,23], apply beam search with 𝑏𝑒𝑎𝑚𝑠𝑖𝑧𝑒 ∈ [3,4,5].and use 50 submissions in public-test round to evaluate. For the private test, we use two values 𝑚𝑎𝑥𝑙𝑒𝑛𝑔𝑡ℎ ∈ [22,23,24].…”