ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053390
|View full text |Cite
|
Sign up to set email alerts
|

A Unified Sequence-to-Sequence Front-End Model for Mandarin Text-to-Speech Synthesis

Abstract: In Mandarin text-to-speech (TTS) system, the front-end text processing module significantly influences the intelligibility and naturalness of synthesized speech. Building a typical pipeline-based front-end which consists of multiple individual components requires extensive efforts. In this paper, we proposed a unified sequence-to-sequence front-end model for Mandarin TTS that converts raw texts to linguistic features directly. Compared to the pipeline-based front-end, our unified front-end can achieve comparab… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 22 publications
(10 citation statements)
references
References 17 publications
0
8
0
Order By: Relevance
“…To reduce the cumulative training error of each part and simplify the model, the components of the text front-end with various functions can be combined together. Pan et al [155] proposed a Mandarin text front-end model that unifies a series of text processing components, which can directly convert the original text into linguistic features. Firstly, the original text is normalized by the method proposed by Zhang et al [258] Then, the Word2Vec model is used to convert sentences into character embedding, and an auxiliary model composed of dilated convolution or Transformer encoder is used to predict CWS and POS respectively.…”
Section: Unified Text Front-endmentioning
confidence: 99%
See 1 more Smart Citation
“…To reduce the cumulative training error of each part and simplify the model, the components of the text front-end with various functions can be combined together. Pan et al [155] proposed a Mandarin text front-end model that unifies a series of text processing components, which can directly convert the original text into linguistic features. Firstly, the original text is normalized by the method proposed by Zhang et al [258] Then, the Word2Vec model is used to convert sentences into character embedding, and an auxiliary model composed of dilated convolution or Transformer encoder is used to predict CWS and POS respectively.…”
Section: Unified Text Front-endmentioning
confidence: 99%
“…A simple method to reduce exposure bias is scheduled sampling [13], in which acoustic feature frames of the current time step are predicted by using natural acoustic feature frames or those predicted by the previous time step with a certain probability [141,155]. However, due to the inconsistency between the natural speech frames and the predicted speech frames during the scheduled sampling, the temporal correlation of the acoustic feature sequence is destroyed, leading to the decline of the quality of the synthesized speech.…”
Section: Stable Autoregressive Generation Processmentioning
confidence: 99%
“…Despite the difference in the source text, Chinese Braille speech synthesis and Chinese speech synthesis face similar challenges, such as disambiguity in text-topronunciation conversion and the prediction of multi-level prosodic structure. We have seen some related research work, such as Chinese text regularization based on multi-head attention [5] and unified Chinese text front-end processing using the structure similar to Tacotron2 [6]. Chinese Braille speech synthesis also faces additional challenges.…”
Section: Introductionmentioning
confidence: 99%
“…The front-end text processing system plays an important role that influences the intelligibility and naturalness in a Mandarin text-to-speech (TTS) system [1,2]. The typical Mandarin front-end is usually designed as a pipeline-based structure that consists of a series of individual components, such as, polyphone disambiguation (PD), text normalization (TN), prosodic boundary prediction (PBP), Chinese word segmentation (CWS), part-of-speech (POS), etc.…”
Section: Introductionmentioning
confidence: 99%
“…Application of multi-task learning (MTL) and fine-tuning a pre-trained model showed a mount of impressive results in front-end text processing tasks [3,4]. Nevertheless, a pipeline-based front-end with complex structure also brings several problems including error propagation, inference latency, and misalignment in optimization [1]. Moreover, each component is modeled separately, which increases the complexity of the system and reduces maintainability.…”
Section: Introductionmentioning
confidence: 99%