“…Baseline Models In Table 1, we compared our method with end-to-end baseline models whose audio inputs are 80-channel log Mel-filter bank, including: FairseqST (Wang et al, 2020a), NeurST (Zhao et al, 2021a), Espnet ST (Inaguma et al, 2020), Dual-decoder Transformer (Le et al, 2020), SATE , Speechformer (Papi et al, 2021), self training and mutual learning (Zhao et al, 2021b) method, STAST , bi-KD (Inaguma et al, 2021), MLT method (Tang et al, 2021b), Lightweight Adaptor (Le et al, 2021), JT-S-MT (Tang et al, 2021a), FAT-ST , TaskAware (Indurthi et al, 2021), and STPT (Tang et al, 2022). We also compare our method to baseline models that have pretrained Wav2vec2.0 as a module, including:…”