“…In natural language processing (NLP), unsupervised pre-training of language models (Devlin et al, 2018;Radford et al, 2018; improved many tasks such as text classification, phrase structure parsing and machine translation Lample & Conneau, 2019). In speech processing, pre-training has focused on emotion recogniton (Lian et al, 2018), speaker identification , phoneme discrimination (Synnaeve & Dupoux, 2016a;van den Oord et al, 2018) as well as transferring ASR representations from one language to another (Kunze et al, 2017). There has been work on unsupervised learning for speech but the resulting representations have not been applied to improve supervised speech recognition (Synnaeve & Dupoux, 2016b;Kamper et al, 2017;Chung et al, 2018;Chen et al, 2018;Chorowski et al, 2019).…”