A Unified Sequence-to-Sequence Front-End Model for Mandarin Text-to-Speech Synthesis

Pan, Junjie; Yin, Xiang; Zhang, Zhiling; Liu, Shichao; Zhang, Yang; Ma, Zejun; Wang, Yuxuan

doi:10.1109/icassp40776.2020.9053390

Cited by 22 publications

(10 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To reduce the cumulative training error of each part and simplify the model, the components of the text front-end with various functions can be combined together. Pan et al [155] proposed a Mandarin text front-end model that unifies a series of text processing components, which can directly convert the original text into linguistic features. Firstly, the original text is normalized by the method proposed by Zhang et al [258] Then, the Word2Vec model is used to convert sentences into character embedding, and an auxiliary model composed of dilated convolution or Transformer encoder is used to predict CWS and POS respectively.…”

Section: Unified Text Front-endmentioning

confidence: 99%

“…A simple method to reduce exposure bias is scheduled sampling [13], in which acoustic feature frames of the current time step are predicted by using natural acoustic feature frames or those predicted by the previous time step with a certain probability [141,155]. However, due to the inconsistency between the natural speech frames and the predicted speech frames during the scheduled sampling, the temporal correlation of the acoustic feature sequence is destroyed, leading to the decline of the quality of the synthesized speech.…”

Section: Stable Autoregressive Generation Processmentioning

confidence: 99%

See 1 more Smart Citation

Review of end-to-end speech synthesis technology based on deep learning

Mu¹,

Yang²,

Dong³

2021

Preprint

View full text Add to dashboard Cite

As an indispensable part of modern humancomputer interaction system, speech synthesis technology helps users get the output of intelligent machine more easily and intuitively, thus has attracted more and more attention. Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech synthesis technology, which has more powerful modeling ability and a simpler pipeline. It mainly consists of three modules: text frontend, acoustic model, and vocoder. This paper reviews the research status of these three parts, and classifies and compares various methods according to their emphasis. Moreover, this paper also summarizes the opensource speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and objective speech quality evaluation method. Finally, some attractive future research directions are pointed out.

show abstract

Section: Unified Text Front-endmentioning

confidence: 99%

Section: Stable Autoregressive Generation Processmentioning

confidence: 99%

Review of end-to-end speech synthesis technology based on deep learning

Mu¹,

Yang²,

Dong³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Despite the difference in the source text, Chinese Braille speech synthesis and Chinese speech synthesis face similar challenges, such as disambiguity in text-topronunciation conversion and the prediction of multi-level prosodic structure. We have seen some related research work, such as Chinese text regularization based on multi-head attention [5] and unified Chinese text front-end processing using the structure similar to Tacotron2 [6]. Chinese Braille speech synthesis also faces additional challenges.…”

Section: Introductionmentioning

confidence: 99%

Speech Synthesis of Chinese Braille with Limited Training Data

Mao

Zhu

Wang

et al. 2021

2021 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

This paper describes to our knowledge the first Chinese Braille speech synthesis system. The system consists of modules of Braille front-end processing, prosody prediction, and speech synthesis. The Braille front-end processing includes conversion from the common Braille to Pinyin, and a high-precision Chinese character prediction model. To achieve high precision prosody prediction under limited corpus conditions, we propose a prosody prediction model based on the RoBERTa pre-trained model, which achieves an accuracy of 94.42%. Finally, a real-time TTS system based on Tacotron2 and LPCNet is proposed. We modify Tacotron2, including introducing a forward attention mechanism and extending the autoregressive correlation step size to obtain more natural speech.

show abstract

“…The front-end text processing system plays an important role that influences the intelligibility and naturalness in a Mandarin text-to-speech (TTS) system [1,2]. The typical Mandarin front-end is usually designed as a pipeline-based structure that consists of a series of individual components, such as, polyphone disambiguation (PD), text normalization (TN), prosodic boundary prediction (PBP), Chinese word segmentation (CWS), part-of-speech (POS), etc.…”

Section: Introductionmentioning

confidence: 99%

“…Application of multi-task learning (MTL) and fine-tuning a pre-trained model showed a mount of impressive results in front-end text processing tasks [3,4]. Nevertheless, a pipeline-based front-end with complex structure also brings several problems including error propagation, inference latency, and misalignment in optimization [1]. Moreover, each component is modeled separately, which increases the complexity of the system and reduces maintainability.…”

Section: Introductionmentioning

confidence: 99%

A Universal Bert-Based Front-End Model for Mandarin Text-To-Speech Synthesis

Bai¹,

Hu²

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The front-end text processing module is considered as an essential part that influences the intelligibility and naturalness of a Mandarin text-to-speech system significantly. For commercial text-to-speech systems, the Mandarin front-end should meet the requirements of high accuracy and low time latency while also ensuring maintainability. In this paper, we propose a universal BERT-based model that can be used for various tasks in the Mandarin front-end without changing its architecture. The feature extractor and classifiers in the model are shared for several sub-tasks, which improves the expandability and maintainability. We trained and evaluated the model with polyphone disambiguation, text normalization, and prosodic boundary prediction for single task modules and multi-task learning. Results show that, the model maintains high performance for single task modules and shows higher accuracy and lower time latency for multi-task modules, indicating that the proposed universal front-end model is promising as a maintainable Mandarin front-end for commercial applications.

show abstract

A Unified Sequence-to-Sequence Front-End Model for Mandarin Text-to-Speech Synthesis

Cited by 22 publications

References 17 publications

Review of end-to-end speech synthesis technology based on deep learning

Review of end-to-end speech synthesis technology based on deep learning

Speech Synthesis of Chinese Braille with Limited Training Data

A Universal Bert-Based Front-End Model for Mandarin Text-To-Speech Synthesis

Contact Info

Product

Resources

About