ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9415058
|View full text |Cite
|
Sign up to set email alerts
|

A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

Abstract: Attention-based sequence-to-sequence modeling provides a powerful and elegant solution for applications that need to map one sequence to a different sequence. Its success heavily relies on the availability of large amounts of training data. This presents a challenge for speech applications where labelled speech data is very expensive to obtain, such as automatic speech recognition (ASR) and speech translation (ST). In this study, we propose a general multi-task learning framework to leverage text data for ASR … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
30
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
2

Relationship

3
7

Authors

Journals

citations
Cited by 46 publications
(31 citation statements)
references
References 21 publications
(30 reference statements)
0
30
0
1
Order By: Relevance
“…The SpecAugment (Park et al, 2019) data augmentation with the LB policy is applied in all experiments. The input text tokens are converted into their corresponding pronunciation form as phoneme sequences (Tang et al, 2021;Renduchintala et al, 2018). The grapheme to phoneme conversion is done through the "g2p en" python package (Lee and Kim, 2018).…”
Section: Methodsmentioning
confidence: 99%
“…The SpecAugment (Park et al, 2019) data augmentation with the LB policy is applied in all experiments. The input text tokens are converted into their corresponding pronunciation form as phoneme sequences (Tang et al, 2021;Renduchintala et al, 2018). The grapheme to phoneme conversion is done through the "g2p en" python package (Lee and Kim, 2018).…”
Section: Methodsmentioning
confidence: 99%
“…Baseline Systems We compare our method with several strong end-to-end ST systems including: Fairseq ST (Wang et al, 2020a), AFS , DDT (Le et al, 2020), MTL (Tang et al, 2021b), Self-training ), BiKD (Inaguma et al, 2021a, FAT-ST (Zheng et al, 2021a), JT-S-MT (Tang et al, 2021a), SATE , Chimera (Han et al, 2021) and XSTNet (Ye et al, 2021). Besides, we implement a strong baseline W2V2-Transformer based on Wav2vec2.0.…”
Section: Mixup Ratio Strategymentioning
confidence: 99%
“…End-to-end ST Since its first proof-of-concept work (Bérard et al, 2016;Duong et al, 2016), solving Speech Translation in an end-to-end manner has attracted extensive attention (Vila et al, 2018;Salesky et al, 2018Salesky et al, , 2019Di Gangi et al, 2019b;Bahar et al, 2019a;Di Gangi et al, 2019c;Inaguma et al, 2020). Standard training techniques such as pretraining (Weiss et al, 2017;Bérard et al, 2018;Bansal et al, 2018;Stoian et al, 2020;Wang et al, 2020a;, multi-task training (Vydana et al, 2021;Le et al, 2020;Tang et al, 2021), meta-learning (Indurthi et al, 2020), and curriculum learning (Kano et al, 2017;Wang et al, 2020b) have been applied. As ST data are expensive to collect, Jia et al (2019); Pino et al (2019); Bahar et al (2019b) augment synthesized data from ASR and MT corpora.…”
Section: Related Workmentioning
confidence: 99%