“…End-to-end ST Since its first proof-of-concept work (Bérard et al, 2016;Duong et al, 2016), solving Speech Translation in an end-to-end manner has attracted extensive attention (Vila et al, 2018;Salesky et al, 2018Salesky et al, , 2019Di Gangi et al, 2019b;Bahar et al, 2019a;Di Gangi et al, 2019c;Inaguma et al, 2020). Standard training techniques such as pretraining (Weiss et al, 2017;Bérard et al, 2018;Bansal et al, 2018;Stoian et al, 2020;Wang et al, 2020a;, multi-task training (Vydana et al, 2021;Le et al, 2020;Tang et al, 2021), meta-learning (Indurthi et al, 2020), and curriculum learning (Kano et al, 2017;Wang et al, 2020b) have been applied. As ST data are expensive to collect, Jia et al (2019); Pino et al (2019); Bahar et al (2019b) augment synthesized data from ASR and MT corpora.…”