“…The ASR performance is reported to be better by combining the CTC loss with the attention mechanism [28] or using the Transformer structure [14,15]. In particular, the Transformer structure, which is originally designed to handle the natural language processing (NLP) problems [29,30], has been successfully utilized in several other domains, such as computer vision (CV) [31,32], and speech-related tasks including text to speech (TTS) [33,34,18,19], voice conversion (VC) [35], and ASR [12,13].…”