ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414413
|View full text |Cite
|
Sign up to set email alerts
|

Camp: A Two-Stage Approach to Modelling Prosody in Context

Abstract: Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
15
2

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 13 publications
(17 citation statements)
references
References 28 publications
0
15
2
Order By: Relevance
“…Based on the proposed cross-speaker reading style transfer model, an automatic audiobook generation system could be constructed by incorporating a text analysis model which predicts the LPE and genre from given book content. In our practice, the prediction model is implemented with RNN and linear layers, which takes BERT [18] token embedding and Tacotron-2 phoneme embedding as its inputs, similar to existing methods [19,10]. According to the predicted genre label and the identity of the desired speaker, the GSE vectors on each branch could be obtained by choosing the averaged GSE vectors over the training data of the target genre/speaker.…”
Section: Automatic Audiobook Generationmentioning
confidence: 99%
“…Based on the proposed cross-speaker reading style transfer model, an automatic audiobook generation system could be constructed by incorporating a text analysis model which predicts the LPE and genre from given book content. In our practice, the prediction model is implemented with RNN and linear layers, which takes BERT [18] token embedding and Tacotron-2 phoneme embedding as its inputs, similar to existing methods [19,10]. According to the predicted genre label and the identity of the desired speaker, the GSE vectors on each branch could be obtained by choosing the averaged GSE vectors over the training data of the target genre/speaker.…”
Section: Automatic Audiobook Generationmentioning
confidence: 99%
“…[18] leverages ToBI labels into neural TTS to improve the prosody. Recent works [19,20] attempt to sample from the learned prosodic distribution using contextual * corresponding author information. However, sentence-level prosody related to the speaker's intention draws much less attention.…”
Section: Introductionmentioning
confidence: 99%
“…However, various methods [7,8,9,10] based on sampling or reference audios do not consider textual semantic information, nor does the pre-trained prosody encoder [17]. Recent attempts on injecting various linguistic features [13,14,15,16] and context-based prosody sampling [19,20] can compensate for the information loss, but they do not provide discriminative information about sentence types. Therefore, the performance on rising intonation featured by declarative questions is not well studied yet.…”
Section: Introductionmentioning
confidence: 99%
“…Context-related works There has been a wide range of research focused on learning or extracting contextual information to improve the performance of TTS. Multiple studies used textual context information extracted from text to improve the sentence prosody [37,38] or cross-sentence prosody [39] for sentence-based speech synthesis, or capture conversation information for conversational speech synthesis [40]. Specifically, the textual context information can be semanticsrelated features extracted by pre-trained models [39,40], i.e., BERT [41] or syntax-related features represented by parse trees [38,42] or statistics [37,40].…”
Section: Introductionmentioning
confidence: 99%