A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

Tang, Yun; Pino, Juan; Wang, Changhan; Ma, Xutai; Genzel, Dmitriy

doi:10.1109/icassp39728.2021.9415058

Cited by 46 publications

(31 citation statements)

References 21 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The SpecAugment (Park et al, 2019) data augmentation with the LB policy is applied in all experiments. The input text tokens are converted into their corresponding pronunciation form as phoneme sequences (Tang et al, 2021;Renduchintala et al, 2018). The grapheme to phoneme conversion is done through the "g2p en" python package (Lee and Kim, 2018).…”

Section: Methodsmentioning

confidence: 99%

Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task

Tang

Pino

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Pretraining and multitask learning are widely used to improve the speech to text translation performance. In this study, we are interested in training a speech to text translation model along with an auxiliary text to text translation task. We conduct a detailed analysis to understand the impact of the auxiliary task on the primary task within the multitask learning framework. Our analysis confirms that multitask learning tends to generate similar decoder representations from different modalities and preserve more information from the pretrained text translation modules. We observe minimal negative transfer effect between the two tasks and sharing more parameters is helpful to transfer knowledge from the text task to the speech task. The analysis also reveals that the modality representation difference at the top decoder layers is still not negligible, and those layers are critical for the translation quality. Inspired by these findings, we propose three methods to improve translation quality. First, a parameter sharing and initialization strategy is proposed to enhance information sharing between the tasks. Second, a novel attention-based regularization is proposed for the encoders and pulls the representations from different modalities closer. Third, an online knowledge distillation is proposed to enhance the knowledge transfer from the text to the speech task. Our experiments show that the proposed approach improves translation performance by more than 2 BLEU over a strong baseline and achieves state-of-theart results on the MUST-C English-German, English-French and English-Spanish language pairs.

show abstract

Section: Methodsmentioning

confidence: 99%

Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task

Tang

Pino

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Baseline Systems We compare our method with several strong end-to-end ST systems including: Fairseq ST (Wang et al, 2020a), AFS , DDT (Le et al, 2020), MTL (Tang et al, 2021b), Self-training ), BiKD (Inaguma et al, 2021a, FAT-ST (Zheng et al, 2021a), JT-S-MT (Tang et al, 2021a), SATE , Chimera (Han et al, 2021) and XSTNet (Ye et al, 2021). Besides, we implement a strong baseline W2V2-Transformer based on Wav2vec2.0.…”

Section: Mixup Ratio Strategymentioning

confidence: 99%

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

Qingkai¹,

Ye²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a selflearning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.

show abstract

“…End-to-end ST Since its first proof-of-concept work (Bérard et al, 2016;Duong et al, 2016), solving Speech Translation in an end-to-end manner has attracted extensive attention (Vila et al, 2018;Salesky et al, 2018Salesky et al, , 2019Di Gangi et al, 2019b;Bahar et al, 2019a;Di Gangi et al, 2019c;Inaguma et al, 2020). Standard training techniques such as pretraining (Weiss et al, 2017;Bérard et al, 2018;Bansal et al, 2018;Stoian et al, 2020;Wang et al, 2020a;, multi-task training (Vydana et al, 2021;Le et al, 2020;Tang et al, 2021), meta-learning (Indurthi et al, 2020), and curriculum learning (Kano et al, 2017;Wang et al, 2020b) have been applied. As ST data are expensive to collect, Jia et al (2019); Pino et al (2019); Bahar et al (2019b) augment synthesized data from ASR and MT corpora.…”

Section: Related Workmentioning

confidence: 99%

Learning Shared Semantic Space for Speech-to-Text Translation

Han¹,

Wang

Ji³

et al. 2021

Preprint

View full text Add to dashboard Cite

Having numerous potential applications and great impact, end-to-end speech translation (ST) has long been treated as an independent task, failing to fully draw strength from the rapid advances of its sibling -text machine translation (MT). With text and audio inputs represented differently, the modality gap has rendered MT data and its end-to-end models incompatible with their ST counterparts. In observation of this obstacle, we propose to bridge this representation gap with Chimera. By projecting audio and text features to a common semantic representation, Chimera unifies MT and ST tasks and boosts the performance on ST benchmarks, MuST-C and Augmented Librispeech, to a new state-of-theart. Specifically, Chimera obtains 27.1 BLEU on MuST-C EN-DE, improving the SOTA by a +1.9 BLEU margin. Further experimental analyses demonstrate that the shared semantic space indeed conveys common knowledge between these two tasks and thus paves a new way for augmenting training resources across modalities. Code, data, and resources are available at https://github.com/ Glaciohound/Chimera-ST.

show abstract

A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

Cited by 46 publications

References 21 publications

Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task

Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

Learning Shared Semantic Space for Speech-to-Text Translation

Contact Info

Product

Resources

About