“…The text processing front-end, such as text normalization, grapheme-to-phoneme (G2P), and phrase break prediction, has become a core part of modern text-to-speech (TTS) systems. Many studies have shown that these text processing modules successfully leverage the naturalness of synthetic speech in various approaches, including traditional statistical methods [1][2][3] and deep learning-based methods [4][5][6][7][8][9]. Recently, with the great success of BERT [10] in various natural language processing (NLP) tasks, most of the proposed works have adopted mainstream pre-trained language models (PLMs) [10][11][12][13].…”