“…Stability To stabilize trainings of Transformer-based neural language models, there have been various discussions on the architecture (Xiong et al, 2020;Liu et al, 2020;Zeng et al, 2023;Zhai et al, 2023), initialization method (Nguyen & Salazar, 2019;Zhang et al, 2019b;Huang et al, 2020;Wang et al, 2022), training strategy (Zhang et al, 2022;Li et al, 2022), and loss function (Chowdhery et al, 2022;Wortsman et al, 2023). Xiong et al (2020) theoretically analyzed gradient scales of each part in Transformers, and indicated that the Pre-LN Transformer is more stable than the Post-LN Transformer, that is the original Transformer architecture (Vaswani et al, 2017).…”