Camp: A Two-Stage Approach to Modelling Prosody in Context

Hodari, Zack; Moinet, Alexis; Karlapati, Sri; Lorenzo-Trueba, Jaime; Merritt, Thomas; Joly, Arnaud; Abbas, Ammar; Karanasou, Penny; Drugman, Thomas

doi:10.1109/icassp39728.2021.9414413

Cited by 13 publications

(17 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Based on the proposed cross-speaker reading style transfer model, an automatic audiobook generation system could be constructed by incorporating a text analysis model which predicts the LPE and genre from given book content. In our practice, the prediction model is implemented with RNN and linear layers, which takes BERT [18] token embedding and Tacotron-2 phoneme embedding as its inputs, similar to existing methods [19,10]. According to the predicted genre label and the identity of the desired speaker, the GSE vectors on each branch could be obtained by choosing the averaged GSE vectors over the training data of the target genre/speaker.…”

Section: Automatic Audiobook Generationmentioning

confidence: 99%

Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

Li¹,

Song²,

Wei³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Cross-speaker style transfer aims to extract the speech style of the given reference speech, which can be reproduced in the timbre of arbitrary target speakers. Existing methods on this topic have explored utilizing utterance-level style labels to perform style transfer via either global or local scale style representations. However, audiobook datasets are typically characterized by both the local prosody and global genre, and are rarely accompanied by utterance-level style labels. Thus, properly transferring the reading style across different speakers remains a challenging task. This paper aims to introduce a chunk-wise multi-scale cross-speaker style model to capture both the global genre and the local prosody in audiobook speeches. Moreover, by disentangling speaker timbre and style with the proposed switchable adversarial classifiers, the extracted reading style is made adaptable to the timbre of different speakers. Experiment results confirm that the model manages to transfer a given reading style to new target speakers. With the support of local prosody and global genre type predictor, the potentiality of the proposed method in multi-speaker audiobook generation is further revealed.

show abstract

Section: Automatic Audiobook Generationmentioning

confidence: 99%

Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

Li¹,

Song²,

Wei³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…[18] leverages ToBI labels into neural TTS to improve the prosody. Recent works [19,20] attempt to sample from the learned prosodic distribution using contextual * corresponding author information. However, sentence-level prosody related to the speaker's intention draws much less attention.…”

Section: Introductionmentioning

confidence: 99%

“…However, various methods [7,8,9,10] based on sampling or reference audios do not consider textual semantic information, nor does the pre-trained prosody encoder [17]. Recent attempts on injecting various linguistic features [13,14,15,16] and context-based prosody sampling [19,20] can compensate for the information loss, but they do not provide discriminative information about sentence types. Therefore, the performance on rising intonation featured by declarative questions is not well studied yet.…”

Section: Introductionmentioning

confidence: 99%

A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis

Bai¹,

Ko²

2022

Interspeech 2022

View full text Add to dashboard Cite

In human speech, the attitude of a speaker cannot be fully expressed only by the textual content. It has to come along with the intonation. Declarative questions are commonly used in daily Cantonese conversations, and they are usually uttered with rising intonation. Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences due to the loss of semantic information. Though it has become more common to complement the systems with extra language models, their performance in modeling rising intonation is not well studied. In this paper, we propose to complement the Cantonese TTS model with a BERT-based statement/question classifier. We design different training strategies and compare their performance. We conduct our experiments on a Cantonese corpus named CanTTS. Empirical results show that the separate training approach obtains the best generalization performance and feasibility.

show abstract

“…Context-related works There has been a wide range of research focused on learning or extracting contextual information to improve the performance of TTS. Multiple studies used textual context information extracted from text to improve the sentence prosody [37,38] or cross-sentence prosody [39] for sentence-based speech synthesis, or capture conversation information for conversational speech synthesis [40]. Specifically, the textual context information can be semanticsrelated features extracted by pre-trained models [39,40], i.e., BERT [41] or syntax-related features represented by parse trees [38,42] or statistics [37,40].…”

Section: Introductionmentioning

confidence: 99%

ParaTTS: Learning Linguistic and Prosodic Cross-Sentence Information in Paragraph-Based TTS

Xue

Soong

Zhang

et al. 2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Recent advancements in neural end-to-end TTS models have shown high-quality, natural synthesized speech in a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when a whole paragraph is considered in TTS, where a large amount of contextual information needs to be considered in building a paragraph-based TTS model. To alleviate the difficulty in training, we propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training. Three sub-modules, including linguistics-aware, prosody-aware and sentence-position networks, are trained together with a modified Tacotron2. Specifically, to learn the information embedded in a paragraph and the relations among the corresponding component sentences, we utilize linguistics-aware and prosody-aware networks. The information in a paragraph is captured by encoders and the inter-sentence information in a paragraph is learned with multi-head attention mechanisms. The relative sentence position in a paragraph is explicitly exploited by a sentence-position network. Trained on a storytelling audio-book corpus (4.08 hours), recorded by a female Mandarin Chinese speaker, the proposed TTS model demonstrates that it can produce rather natural and goodquality speech paragraph-wise. The cross-sentence contextual information, such as break and prosodic variations between consecutive sentences, can be better predicted and rendered than the sentence-based model. Tested on paragraph texts, of which the lengths are similar to, longer than, or much longer than the typical paragraph length of the training data, the TTS speech produced by the new model is consistently preferred over the sentence-based model in subjective tests and confirmed in objective measures.

show abstract

Camp: A Two-Stage Approach to Modelling Prosody in Context

Cited by 13 publications

References 28 publications

Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis

ParaTTS: Learning Linguistic and Prosodic Cross-Sentence Information in Paragraph-Based TTS

Contact Info

Product

Resources

About