Improving Sequence-To-Sequence Speech Recognition Training with On-The-Fly Data Augmentation

Nguyen, Thai-Son; Stüker, Sebastian; Niehues, Jan; Waibel, Alex

doi:10.1109/icassp40776.2020.9054130

Cited by 74 publications

(58 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…FBK (Gaido et al, 2020) participated with an end-to-end-system adapting the S-Transformer model (Di Gangi et al, 2019b,c). Its training is based on: i) transfer learning (via ASR pretraining and -word/sequence -knowledge distillation), ii) data augmentation (with SpecAugment (Park et al, 2019), time stretch (Nguyen et al, 2020a) and synthetically-created data), iii) combining synthetic and real data marked as different "domains" as in (Di Gangi et al, 2019d), and iv) multitask learning using the CTC loss (Graves et al, 2006). Once the training with wordlevel knowledge distillation is complete the model is fine-tuned using label smoothed cross entropy (Szegedy et al, 2016).…”

Section: Submissionsmentioning

confidence: 99%

“…(1) ASR (both LSTM (Nguyen et al, 2020b) and Transformer-based (Pham et al, 2019a)) ( 2) Segmentation (with a monolingual NMT system (Sperber et al, 2018) that adds sentence boundaries and case, also inserting proper punctuation), and (3) MT (a Transformer-based encoderdecoder model implementing Relative Attention following (Dai et al, 2019) adapted via fine-tuning on data incorporating artificially-injected noise). The WerRTCVAD toolkit 15 is used to process the unsegmented test set.…”

Section: Submissionsmentioning

confidence: 99%

See 1 more Smart Citation

Findings of the Iwslt 2020 Evaluation Campaign

Ansari¹,

Axelrod²,

Bach³

et al. 2020

Proceedings of the 17th International Conference on Spoken Language Translation

Self Cite

View full text Add to dashboard Cite

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2020) featured this year six challenge tracks: (i) Simultaneous speech translation, (ii) Video speech translation, (iii) Offline speech translation, (iv) Conversational speech translation, (v) Open domain translation, and (vi) Non-native speech translation. A total of 30 teams participated in at least one of the tracks. This paper introduces each track's goal, data and evaluation metrics, and reports the results of the received submissions.

show abstract

Section: Submissionsmentioning

confidence: 99%

Section: Submissionsmentioning

confidence: 99%

Findings of the Iwslt 2020 Evaluation Campaign

Ansari¹,

Axelrod²,

Bach³

et al. 2020

Proceedings of the 17th International Conference on Spoken Language Translation

Self Cite

View full text Add to dashboard Cite

show abstract

“…Model We only focus on sequence-to-sequence ASR models, which are based on two different network architectures: The long short-term memory (LSTM) and the Transformer. Our LSTM-based models consist of 6 bidirectional layers of 1024 units for the encoder and 2 unidirectional layers for the decoder (Nguyen et al, 2019). Our transformerbased models presented in (Pham et al, 2019b) consist of 32 blocks for the encoder and 12 blocks for the decoder.…”

Section: Speech Recognitionmentioning

confidence: 99%

KIT’s IWSLT 2020 SLT Translation System

Pham¹,

Schneider²,

Nguyen³

et al. 2020

Proceedings of the 17th International Conference on Spoken Language Translation

View full text Add to dashboard Cite

This paper describes KIT's submissions to the IWSLT2020 Speech Translation evaluation campaign. We first participate in the simultaneous translation task, in which our simultaneous models are Transformer-based and can be efficiently trained to obtain low latency with minimized compromise in quality. On the offline speech translation task, we applied our new Speech Transformer architecture to endto-end speech translation. The obtained model can provide translation quality which is competitive to a complicated cascade. The latter still has the upper hand, thanks to the ability to transparently access to the transcription, and resegment the inputs to avoid fragmentation.

show abstract

“…To this aim, we rely on data augmentation and knowledge transfer techniques that were shown to yield competitive models at the IWSLT-2020 evaluation campaign (Ansari et al, 2020;Potapczyk and Przybysz, 2020;Gaido et al, 2020). In particular, we use three data augmentation methods -SpecAugment (Park et al, 2019), time stretch (Nguyen et al, 2020), and synthetic data generation (Jia et al, 2019) -and we transfer knowledge both from ASR and MT through component initialization and knowledge distillation (Hinton et al, 2015).…”

Section: Base St Modelmentioning

confidence: 99%

Breeding Gender-aware Direct Speech Translation Systems

Gaido¹,

Savoldi²,

Bentivogli

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

In automatic speech translation (ST), traditional cascade approaches involving separate transcription and translation steps are giving ground to increasingly competitive and more robust direct solutions. In particular, by translating speech audio data without intermediate transcription, direct ST models are able to leverage and preserve essential information present in the input (e.g. speaker's vocal characteristics) that is otherwise lost in the cascade framework. Although such ability proved to be useful for gender translation, direct ST is nonetheless affected by gender bias just like its cascade counterpart, as well as machine translation and numerous other natural language processing applications. Moreover, direct ST systems that exclusively rely on vocal biometric features as a gender cue can be unsuitable and potentially harmful for certain users. Going beyond speech signals, in this paper we compare different approaches to inform direct ST models about the speaker's gender and test their ability to handle gender translation from English into Italian and French. To this aim, we manually annotated large datasets with speakers' gender information and used them for experiments reflecting different possible real-world scenarios. Our results show that gender-aware direct ST solutions can significantly outperform strong -but gender-unaware -direct ST models. In particular, the translation of gender-marked words can increase up to 30 points in accuracy while preserving overall translation quality.

show abstract

Improving Sequence-To-Sequence Speech Recognition Training with On-The-Fly Data Augmentation

Cited by 74 publications

References 16 publications

Findings of the Iwslt 2020 Evaluation Campaign

Findings of the Iwslt 2020 Evaluation Campaign

KIT’s IWSLT 2020 SLT Translation System

Breeding Gender-aware Direct Speech Translation Systems

Contact Info

Product

Resources

About