PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

Jia, Yuheng; Zen, Heiga; Shen, Jonathan; Wu, Yonghui

doi:10.21437/interspeech.2021-1757

Cited by 33 publications

(28 citation statements)

References 19 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The subjective evaluation results show that the whole word masking strategy increases TTS performance. The work [23] also shows a similar discovery. We consider that when the representation capacity of the model input is not changed, increasing the difficulty of the MLM prediction task to some extent might improve the performance of the downstream TTS task.…”

Section: Analysis On Masking Strategysupporting

confidence: 68%

“…We evaluate the voice quality and inference latency of Mixed-Phoneme BERT compared with the recent TTS pre-trained model, PnG BERT [23], which has a similar number of model parameters and training steps to Mixed-Phoneme BERT. We show the CMOS results and inference speedup for melspectrogram generation in Table 2.…”

Section: Compared With the Png Bertmentioning

confidence: 99%

“…Besides, using phoneme and the auxiliary encoder simultaneously will result in larger model parameters. Some works [22,23] attempt to enhance the TTS phoneme encoder with character information directly rather than introducing an auxiliary BERT model. Since the model still needs character-based units as an extra input, inconsistency and corresponding drawbacks still exist.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

Zhang¹,

Song²,

Xu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability to model rich representations and sematic information due to limited phoneme vocabulary. In this paper, we propose Mixed-Phoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability. Specifically, we merge the adjacent phonemes into sup-phonemes and combine the phoneme sequence and the merged sup-phoneme sequence as the model input, which can enhance the model capacity to learn rich contextual representations. Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline. The Mixed-Phoneme BERT achieves 3× inference speedup and similar voice quality to the previous TTS pre-trained model PnG BERT.

show abstract

Section: Analysis On Masking Strategysupporting

confidence: 68%

Section: Compared With the Png Bertmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

Zhang¹,

Song²,

Xu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Methods have shifted from parametric models towards increasingly end-to-end neural networks [6,7]. This shift enabled TTS models to generate speech that sounds as natural as professional human speech [8]. Most approaches consist of three main components: an encoder that converts the input text into a sequence of hidden representations, a decoder that produces acoustic representations like mel-spectrograms from these, and finally a vocoder that constructs waveforms from the acoustic representations.…”

Section: Related Workmentioning

confidence: 99%

“…Most approaches consist of three main components: an encoder that converts the input text into a sequence of hidden representations, a decoder that produces acoustic representations like mel-spectrograms from these, and finally a vocoder that constructs waveforms from the acoustic representations. Some methods including Tacotron and Tacotron 2 use an attention-based autoregressive approach [7,9,10]; followup work such as FastSpeech [11,12], Non-Attentive Tacotron (NAT) [8,13] and Parallel Tacotron [14,15], often replace recurrent neural networks with transformers.…”

Section: Related Workmentioning

confidence: 99%

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

Hassid¹,

Ramanovich²,

Shillingford³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper we present VDTTS, a Visually-Driven Textto-Speech model. Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video.Experimentally, we show our model produces wellsynchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including "in-the-wild" content from VoxCeleb2. We encourage the reader to view the demo videos demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody.

show abstract

Microwave ablation versus laparoscopic resection as first‐line therapy for solitary 3–5‐cm HCC

et al. 2022

View full text Add to dashboard Cite

Background and Aims The study objective was to compare the effectiveness of microwave ablation (MWA) and laparoscopic liver resection (LLR) on solitary 3–5‐cm HCC over time. Approach and Results From 2008 to 2019, 1289 patients from 12 hospitals were enrolled in this retrospective study. Diagnosis of all lesions were based on histopathology. Propensity score matching was used to balance all baseline variables between the two groups in 2008–2019 (n = 335 in each group) and 2014–2019 (n = 257 in each group) cohorts, respectively. For cohort 2008–2019, during a median follow‐up of 35.8 months, there were no differences in overall survival (OS) between MWA and LLR (HR: 0.88, 95% CI 0.65–1.19, p = 0.420), and MWA was inferior to LLR regarding disease‐free survival (DFS) (HR 1.36, 95% CI 1.05–1.75, p = 0.017). For cohort 2014–2019, there was comparable OS (HR 0.85, 95% CI 0.56–1.30, p = 0.460) and approached statistical significance for DFS (HR 1.33, 95% CI 0.98–1.82, p = 0.071) between MWA and LLR. Subgroup analyses showed comparable OS in 3.1–4.0‐cm HCCs (HR 0.88, 95% CI 0.53–1.47, p = 0.630) and 4.1–5.0‐cm HCCs (HR 0.77, 95% CI 0.37–1.60, p = 0.483) between two modalities. For both cohorts, MWA shared comparable major complications (both p > 0.05), shorter hospitalization, and lower cost to LLR (all p < 0.001). Conclusions MWA might be a first‐line alternative to LLR for solitary 3–5‐cm HCC in selected patients with technical advances, especially for patients unsuitable for LLR.

show abstract

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

Cited by 33 publications

References 19 publications

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

Microwave ablation versus laparoscopic resection as first‐line therapy for solitary 3–5‐cm HCC

Contact Info

Product

Resources

About