Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model

Kenter, Tom; Sharma, Manish; Clark, Rob

doi:10.21437/interspeech.2020-1430

Cited by 30 publications

(31 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar method further verifies the ability of BERT to improve prosody on the Chinese multispeaker TTS task [15]. Along different lines, CHiVE-BERT [16] incorporates a BERT model in an RNN-based speech synthesis model. These approaches have improved the prosody of synthesized speech by exploiting the semantic information of the phrase and word from BERT.…”

Section: Introductionmentioning

confidence: 68%

Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

Zhou¹,

Song²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

Semantic information of a sentence is crucial for improving the expressiveness of a text-to-speech (TTS) system, but can not be well learned from the limited training TTS dataset just by virtue of the nowadays encoder structures. As large scale pretrained text representation develops, bidirectional encoder representations from transformers (BERT) has been proven to embody text-context semantic information and applied to TTS as additional input. However BERT can not explicitly associate semantic tokens from point of dependency relations in a sentence. In this paper, to enhance expressiveness, we propose a semantic representation learning method based on graph neural network, considering dependency relations of a sentence. Dependency graph of input text is composed of edges from dependency tree structure considering both the forward and the reverse directions. Semantic representations are then extracted at word level by the relational gated graph network (RGGN) fed with features from BERT as nodes input. Upsampled semantic representations and character-level embeddings are concatenated to serve as the encoder input of Tacotron-2. Experimental results show that our proposed method outperforms the baseline using vanilla BERT features both in LJSpeech and Blizzard Challenge 2013 datasets, and semantic representations learned from the reverse direction are more effective for enhancing expressiveness.

show abstract

Section: Introductionmentioning

confidence: 68%

Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

Zhou¹,

Song²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, the large pretrained language model BERT [9] exhibits an impressive performance on many natural language processing (NLP) tasks, so it is also introduced to TTS [10][11][12]. Refs.…”

Section: Related Workmentioning

confidence: 99%

“…Ref. [12] tries to fine-tune the BERT parameters with a prosody prediction task but still freezes the word piece embeddings. All these works report that they have achieved some gains in naturalness.…”

Section: Related Workmentioning

confidence: 99%

Acoustic Word Embeddings for End-to-End Speech Synthesis

Shen

2021

Applied Sciences

View full text Add to dashboard Cite

The most recent end-to-end speech synthesis systems use phonemes as acoustic input tokens and ignore the information about which word the phonemes come from. However, many words have their specific prosody type, which may significantly affect the naturalness. Prior works have employed pre-trained linguistic word embeddings as TTS system input. However, since linguistic information is not directly relevant to how words are pronounced, TTS quality improvement of these systems is mild. In this paper, we propose a novel and effective way of jointly training acoustic phone and word embeddings for end-to-end TTS systems. Experiments on the LJSpeech dataset show that the acoustic word embeddings dramatically decrease both the training and validation loss in phone-level prosody prediction. Subjective evaluations on naturalness demonstrate that the incorporation of acoustic word embeddings can significantly outperform both pure phone-based system and the TTS system with pre-trained linguistic word embedding.

show abstract

“…Recently in English the field has also used linguistic features to improve prosody: using syllabic stress [21], semantic and syntactic features [22,23] and pre-trained language model embeddings [24,25]. Clockwork RNNs were also used to hierarchically encode linguistic features at varying levels in [26], a hierarchical encoder having previously helped in DNN-based [27,28].…”

Section: Linguistic Features In Tacotronmentioning

confidence: 99%

Liaison and Pronunciation Learning in End-to-End Text-to-Speech in French

Taylor¹,

Maguer²,

Richmond³

2021

11th ISCA Speech Synthesis Workshop (SSW 11)

View full text Add to dashboard Cite

Sequence-to-sequence (S2S) TTS models like Tacotron have grapheme-only inputs when trained fully end-to-end. Grapheme inputs map to phone sounds depending on context, which traditionally is handled by extensive preprocessing in the TTS front-end. However, French orthography does not provide a clear one-to-one mapping between graphemes and sounds, and in English, which similarly has rather non-phonetic orthography, pronunciations are a significant cause of error in S2S-TTS with grapheme-inputs. In this paper, we test implicit pronunciation knowledge where graphemes do not map directly to phones. Implicit pronunciation knowledge learnt in S2S-TTS is similar to a standalone grapheme-to-phoneme (G2P) model, which makes explicit phone predictions at the sequential level. We find grapheme-input S2S-TTS makes implicit pronunciation errors similar to explicit G2P models -notably for foreign names. In a traditional front-end pipeline, there are also postlexical rules which modify G2P output at the sequential level. In French, post-lexical rules require a deep knowledge of linguistic structure in a process called Liaison. Without explicit rules, we find S2S-TTS with grapheme-inputs over-inserts Liaison sounds, leading to a significant preference for a phonebased equivalent. By testing with linguistically-motivated stimuli, we observe differences that would otherwise go undetected.

show abstract

Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model

Cited by 30 publications

References 17 publications

Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

Acoustic Word Embeddings for End-to-End Speech Synthesis

Liaison and Pronunciation Learning in End-to-End Text-to-Speech in French

Contact Info

Product

Resources

About