One goal in speech synthesis-by-rule is to replicate the idiolectal properties of human speakers in order to produce more natural and distinctive synthetic voices. The spectral properties of vowel idiolects can be expected to vary in formant frequencies, timing of transitions, bandwidths, and source spectra. These characteristics can be combined into a model that specifies vowels in terms of formant targets for the nuclei and offglides of the vowels, the percent of the duration at the midpoint of transitions, and the duration of transition [J. Allen, From Text to Speech: The MITalk System (1986)]. These parameters were measured in stressed monosyllables for diphthongs and [Vr] sequences in a frame sentence. Data from an AX discrimination test will be presented which indicate the relative contribution of each model parameter to simulating the natural vowels.
Intelligibility tests of initial and final English consonants were made over a simulated long-distance telephone line for two leading text-to-speech converters descended from MITalk. CVC stimuli were presented to subjects in open-response listening tests to determine how vulnerable the intelligibility of synthetic speech would be to the effects of telephone bandwidth limitations. Subjects were also tested under nontelephone conditions for comparison. It was hypothesized that telephone bandpassing would produce a systematic breakdown in consonant intelligibility, and that certain phonemes, e.g., alveolar fricatives and alveolar stops before front vowels, would suffer the greatest loss due to their reliance on cues normally present at the higher frequencies. Our results show that overall intelligibility of high-quality synthetic speech is significantly reduced over the telephone. As expected, alveolar fricatives produced a large number of errors. Alveolar stops, on the other hand, remained robust and generated few perceptual errors. Velar stops, contrary to expectation, produced a large number of place and manner confusions. Intelligibility scores for initial and final consonants will be presented, and grouped by manner class.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.