Speech Prosody 2016 2016
DOI: 10.21437/speechprosody.2016-162
|View full text |Cite
|
Sign up to set email alerts
|

Data selection for naturalness in HMM-based speech synthesis

Abstract: Can we identify which utterances in a corpus are the best to use for voice training, based on acoustic/prosodic features, and which utterances should be excluded because they will introduce noise, artifacts, or inconsistency into the voice? Can we use found data such as radio broadcast news to build HMM-based synthesized voices? Can we select a subset of training utterances from a corpus of found data to produce a better voice than one trained on all of the data? Which voice training and modeling approaches wo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
17
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
4
1
1

Relationship

2
4

Authors

Journals

citations
Cited by 12 publications
(20 citation statements)
references
References 11 publications
1
17
0
Order By: Relevance
“…So we posted tiebreaker HITs on MTurk. The tie was not resolved for articulation, so we picked the low setting, corresponding with our prior findings [8,9] that training on hypo-articulated utterances tends to produce better voices. For standard deviation of energy, the low setting was slightly preferred.…”
Section: Contextual Feature Labeled Voicesmentioning
confidence: 99%
See 1 more Smart Citation
“…So we posted tiebreaker HITs on MTurk. The tie was not resolved for articulation, so we picked the low setting, corresponding with our prior findings [8,9] that training on hypo-articulated utterances tends to produce better voices. For standard deviation of energy, the low setting was slightly preferred.…”
Section: Contextual Feature Labeled Voicesmentioning
confidence: 99%
“…We are specifically interested in radio broadcast news not only because it contains large amounts of speech from each anchor and is often professionally recorded, but because it is available in many languages. In our previous work [8,9], we explored data selection and outlier removal at the utterance level to produce voices that are rated as more natural, even though they were trained on a smaller amount of data than a baseline trained on all of the data. We selected our subsets based on a number of different acoustic and prosodic features, finding that removing outliers for hyper-articulation and combining filters for hypoarticulation and low mean f0 produced voices rated as significantly more natural.…”
Section: Introductionmentioning
confidence: 99%
“…Other studies using CS for collecting rates on a Likert scale concentrate on audio quality assessments [16,7] employing the discrete 5-point absolute-category-rating (ACR) for subjective Mean Opinion Scores), on voice naturalness [17], and on perceived Quality of Experience (QoE) in a teleconference system [18]. Different to using a Likert scale, the approach in [19] was to ask CS participants to rate correct or incorrect realizations of the /r/ sound in words.…”
Section: Previous Workmentioning
confidence: 99%
“…Paired-comparison approaches have been widely adopted in CS for quality assessment of image [21], video [22], audio [23] and synthetic speech [24,17]. The study in [23] introduces a pair-comparison framework for quantifying QoE of multimedia content as a more convenient approach compared to 5-point scale ratings.…”
Section: Previous Workmentioning
confidence: 99%
See 1 more Smart Citation