10th ISCA Workshop on Speech Synthesis (SSW 10) 2019
DOI: 10.21437/ssw.2019-18
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Abstract: Text-to-speech systems are typically evaluated on single sentences. When long-form content, such as data consisting of full paragraphs or dialogues is considered, evaluating sentences in isolation is not always appropriate as the context in which the sentences are synthesized is missing.In this paper, we investigate three different ways of evaluating the naturalness of long-form text-to-speech synthesis. We compare the results obtained from evaluating sentences in isolation, evaluating whole paragraphs of spee… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
17
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 19 publications
(17 citation statements)
references
References 7 publications
0
17
0
Order By: Relevance
“…Since we provide isolated sentences to listeners, there is no concept of correct prosody [5], as such we do not provide a visible reference and do not require listeners to rate two systems as 0 and 100. 25 listeners completed this test.…”
Section: Prosody Predictionmentioning
confidence: 99%
See 1 more Smart Citation
“…Since we provide isolated sentences to listeners, there is no concept of correct prosody [5], as such we do not provide a visible reference and do not require listeners to rate two systems as 0 and 100. 25 listeners completed this test.…”
Section: Prosody Predictionmentioning
confidence: 99%
“…Unrealistic prosody modelling is often linked to a lack of contextual information. Indeed, without sufficient context, predicting prosody is an ill-posed problem [5], as any number of prosodies could be deemed appropriate for a given text. We believe that this is the biggest limitation of current state-of-the-art TTS models.…”
Section: Introductionmentioning
confidence: 99%
“…Despite the interest in TTS that speaks longer chunks of text, e.g., [16], no major engine currently assumes a connection between consecutive audio files in the corpus. Instead, utterances are traditionally treated in isolation.…”
Section: Bigram Corpus Methodsmentioning
confidence: 99%
“…The main reasons for retaining this custom is that 1) most conventional TTS training material is comprised of sentences read in isolation, and 2) loading longer speech segments (e.g., audiobook paragraphs instead of sentences) into a conventional TTS engine would quickly result in the system running out of memory. This state of affairs is unfortunate in the case of corpora created from longer continuous recordings of speech, such as audiobooks [16,17]. It is even more problematic for spontaneous speech, where syntactic forms do not follow the conventions of written language and cannot be easily separated into standalone, coherent semantic units.…”
Section: Bigram Corpus Methodsmentioning
confidence: 99%
“…Another, somewhat complementary reason arises from the lack of explicit control inherent to the "black-box" machine learning architectures, such as s2s systems. On the one hand, the existing systems are not designed to capture the long-range semantic dependencies [7], on the other hand, they do not facilitate explicit control of prosody akin to older parametric synthesis approaches, where linguistic and prosodic labels were utilized and prosodic parameters were modelled separately [8].…”
Section: Introductionmentioning
confidence: 99%