Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1757
|View full text |Cite
|
Sign up to set email alerts
|

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

Abstract: This paper introduces PnG BERT, a new encoder model for neural TTS. This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them. It can be pre-trained on a large text corpus in a selfsupervised manner, and fine-tuned in a TTS task. Experimental results show that a neural TTS model using a pre-trained PnG BERT as its encoder yields more natural prosody and more accurate pronunciation than a baseline … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
26
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 33 publications
(28 citation statements)
references
References 19 publications
(39 reference statements)
2
26
0
Order By: Relevance
“…The subjective evaluation results show that the whole word masking strategy increases TTS performance. The work [23] also shows a similar discovery. We consider that when the representation capacity of the model input is not changed, increasing the difficulty of the MLM prediction task to some extent might improve the performance of the downstream TTS task.…”
Section: Analysis On Masking Strategysupporting
confidence: 68%
See 2 more Smart Citations
“…The subjective evaluation results show that the whole word masking strategy increases TTS performance. The work [23] also shows a similar discovery. We consider that when the representation capacity of the model input is not changed, increasing the difficulty of the MLM prediction task to some extent might improve the performance of the downstream TTS task.…”
Section: Analysis On Masking Strategysupporting
confidence: 68%
“…We evaluate the voice quality and inference latency of Mixed-Phoneme BERT compared with the recent TTS pre-trained model, PnG BERT [23], which has a similar number of model parameters and training steps to Mixed-Phoneme BERT. We show the CMOS results and inference speedup for melspectrogram generation in Table 2.…”
Section: Compared With the Png Bertmentioning
confidence: 99%
See 1 more Smart Citation
“…Methods have shifted from parametric models towards increasingly end-to-end neural networks [6,7]. This shift enabled TTS models to generate speech that sounds as natural as professional human speech [8]. Most approaches consist of three main components: an encoder that converts the input text into a sequence of hidden representations, a decoder that produces acoustic representations like mel-spectrograms from these, and finally a vocoder that constructs waveforms from the acoustic representations.…”
Section: Related Workmentioning
confidence: 99%
“…Most approaches consist of three main components: an encoder that converts the input text into a sequence of hidden representations, a decoder that produces acoustic representations like mel-spectrograms from these, and finally a vocoder that constructs waveforms from the acoustic representations. Some methods including Tacotron and Tacotron 2 use an attention-based autoregressive approach [7,9,10]; followup work such as FastSpeech [11,12], Non-Attentive Tacotron (NAT) [8,13] and Parallel Tacotron [14,15], often replace recurrent neural networks with transformers.…”
Section: Related Workmentioning
confidence: 99%