“…To avoid this problem and for better efficiency, end-to-end ST models are proposed and become popular in recent years (Berard et al, 2016(Berard et al, , 2018Bansal et al, 2018;. To alleviate the data scarcity problem of end-to-end ST models, various techniques are utilized, including pre-training (Bansal et al, 2019), multi-task learning (Anastasopoulos and Chiang, 2018), knowledge distillation Ren et al, 2020), data synthesis (Jia et al, 2019), self-supervised learning and speech augmentation techniques like SpecAugment (Bahar et al, 2019) or speed perturbation (Stoian et al, 2020). Some studies focus on how to bridge the gap between different modalities (speech and text) or different modules (acoustic and semantic modeling).…”