KIT’s IWSLT 2021 Offline Speech Translation System

Nguyen, Tuan Nam; Nguyen, Thai-Son; Huber, Christian; Pham, Ngoc-Quan; Ha, Thanh-Le; Schneider, Felix; Stüker, Sebastian

doi:10.18653/v1/2021.iwslt-1.13

Cited by 6 publications

(9 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zeng et al (2023) used shrink embedding gradient technique (Ding et al, 2021). In this study, we proved that the layer normalization to the embedding layer prevents the exploding gradients around layer normalizations in internal layers when we use the scaled initialization, which is a widely used initialization method for LLMs (Nguyen & Salazar, 2019;Shoeybi et al, 2020), and thus, it stabilizes the pre-training. In this study, we indicated that an initialization method affects the LLM pre-training dynamics.…”

Section: B Related Workmentioning

confidence: 69%

“…Stability To stabilize trainings of Transformer-based neural language models, there have been various discussions on the architecture (Xiong et al, 2020;Liu et al, 2020;Zeng et al, 2023;Zhai et al, 2023), initialization method (Nguyen & Salazar, 2019;Zhang et al, 2019b;Huang et al, 2020;Wang et al, 2022), training strategy (Zhang et al, 2022;Li et al, 2022), and loss function (Chowdhery et al, 2022;Wortsman et al, 2023). Xiong et al (2020) theoretically analyzed gradient scales of each part in Transformers, and indicated that the Pre-LN Transformer is more stable than the Post-LN Transformer, that is the original Transformer architecture (Vaswani et al, 2017).…”

Section: B Related Workmentioning

confidence: 99%

“…Because an initialization method has considerable influence on the training dynamics of neural methods, various initialization methods have been proposed for Transformers (Zhang et al, 2019b;Nguyen & Salazar, 2019;Zhang 1 To simplify equations, we omit bias terms. et al, 2019a;Wang et al, 2022).…”

Section: Initializationmentioning

confidence: 99%

“…Concretely, we follow Shoeybi et al (2020); Le Scao et al (2022) and initialize W 2 and W O with a normal distribution N (0, σ/ √ 2N ) where N is the number of layers 2 . Following Le Scao et al (2022), we use 2/5d for σ, that is initially proposed by Nguyen & Salazar (2019).…”

Section: Initializationmentioning

confidence: 99%

See 3 more Smart Citations

B2T Connection: Serving Stability and Performance in Deep Transformers

Takase¹,

Kiyono²,

Kobayashi³

et al. 2023

Findings of the Association for Computational Linguistics: ACL 2023

View full text Add to dashboard Cite

Loss spikes often occur during pre-training of large language models. The spikes degrade the performance of large language models and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. To investigate the cause of loss spikes, we focus on gradients of internal layers. Through theoretical analyses, we reveal two causes of the exploding gradients, and provide requirements to prevent the explosion. In addition, we propose a method to satisfy the requirements by combining the initialization method and a simple modification to embeddings. We conduct various experiments to verify our theoretical analyses empirically. Experimental results indicate that the combination is effective in preventing spikes during pre-training.

show abstract

Section: B Related Workmentioning

confidence: 69%

Section: B Related Workmentioning

confidence: 99%

Section: Initializationmentioning

confidence: 99%

Section: Initializationmentioning

confidence: 99%

See 2 more Smart Citations

B2T Connection: Serving Stability and Performance in Deep Transformers

Takase¹,

Kiyono²,

Kobayashi³

et al. 2023

Findings of the Association for Computational Linguistics: ACL 2023

View full text Add to dashboard Cite

show abstract

“…Previous, well-performing systems submitted to the IWLST offline and low-resource speech translation tracks made use of various methods to improve the performance of their cascade system. For the ASR component, many submissions used a combination of transformer and conformer models (Zhang et al, 2022;Li et al, 2022;Nguyen et al, 2021) or fine-tuned existing models (Zhang and Ao, 2022;Zanon Boito et al, 2022;Denisov et al, 2021). They managed to increase ASR performance by voice activity detection for segmentation (Zhang et al, 2022;Ding and Tao, 2021), training the ASR on synthetic data with added punctuation, noise-filtering and domain-specific finetuning (Zhang and Ao, 2022;Li et al, 2022) or adding an intermediate model that cleans the ASR output in terms of casing and punctuation (Nguyen et al, 2021).…”

Section: Previous Iwslt Approaches Formentioning

confidence: 99%

UM-DFKI Maltese Speech Translation

Williams,

Abela,

Kumar

et al. 2023

Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

View full text Add to dashboard Cite

For the 2023 IWSLT (Agarwal et al., 2023) Maltese Speech Translation Task, UM-DFKI jointly presents a cascade solution which achieves 0.6 BLEU. While this is the first time that a Maltese speech translation task has been released by IWSLT, this paper explores previous solutions for other speech translation tasks, focusing primarily on low-resource scenarios. Moreover, we present our method of fine-tuning XLS-R models for Maltese ASR using a collection of multi-lingual speech corpora as well as the fine-tuning of the mBART model for Maltese to English machine translation.

show abstract