Many theories of spoken word recognition assume speech is segmented at syllable or word boundaries prior to contact with the lexicon. A number of researchers [see K. W. Church, Cognition 25, 53–69 (1987)] have observed systematic acoustic differences (or allophonic variations) within a phonetic class that may serve as cues to juncture points between syllables or words. The most common difference cited between syllable initial and syllable final /l/ is the proximity of the first two formants. Some researchers have also noted a difference in the amplitude profile of /l/ depending on its position relative to a word boundary. In this experiment, two starting phrases, ‘‘see leaves’’ and ‘‘seal eaves,’’ were synthesized. F1 frequency profile, F2 frequency profile, and amplitude profile were varied orthogonally for both starting phrases. The junctural position of /l/ was cued primarily by F1 frequency profile. F2 frequency profile, and thus the proximity of the first two formants, and amplitude profile seemed to play a very small role in the perception of /l/ position. [Work supported by NIDCD Grant No. DC 00219 to SUNY at Buffalo.]
180 undergraduates rated level of aspiration and likelihood of success for male or female targets of high, low, or unknown physical attractiveness possessing masculine, feminine, or androgynous gender characteristics for occupations varying in prestige and gender orientation. Perceived level of aspiration and likelihood of success was influenced by sex of target only for female-oriented occupations. Physical attractiveness increased the perceived likelihood of success in high prestige male-oriented and neutral occupations. Gender characteristics influenced perceived level of aspiration for all high prestige occupations but for only one low prestige occupation. Results are discussed relative to changing stereotypes in today's society.
A TTS voice quality experiment was conducted to select a speaker and to evaluate synthesis techniques. Small-scale TTS diphone inventories using six professional female speakers who were pre-selected in an audition were recorded. Two types of inventories were recorded for each speaker: a series of nonsense words and a series of English sentences. Using these 12 inventories, two synthesis methods were compared: PSOLA [Charpentier and Moulines, Eurospeech ’89] and Harmonic Plus Noise (HNM) [Stylianou et al., Eurospeech ’97]. Synthetic prosody closely modeled naturally spoken versions of the target utterances. Three fully synthetic (TTS) and two hybrid (i.e., partly recorded from the human speaker and partly synthesized) sentences formed the experimental stimuli for subjective testing. For references, two MNRU versions of the naturally spoken sentences were used: (a) Q10 (resembling low-end commercial 16-kbps encoded speech) and (b) Q35 (resembling high-quality telephone speech). Forty-one subjects rated intelligibility [I], naturalness [N], and pleasantness [P] on five-point MOS scales. A total of 936 ratings were collected from each subject. Repeated measures of analyses of variance (ANOVAs) were performed on the data. There were significant main effects of speaker, synthesis method, and inventory, plus interactions. It was found that (1) the best speaker consistently outperformed the others on all three rating scales; for the optimal combination of parameters, TTS ratings ranged (across speakers) as follows: [I] 3.64–2.94, [N] 3.36–2.7, [P] 3.34–2.53. (2) HNM outperformed PSOLA (consistently 0.25 points higher for [I], [N], [P] scores), and (3) the diphone inventory extracted from sentences was preferred over that extracted from nonsense words (with a significantly smaller difference of 0.10 for HNM than 0.19 for PSOLA).
Many theories of speech perception rely on the loci of spectral peaks as at least one factor upon which pattern recognition is based. However, when a peak is lower in amplitude than its neighbors, it may not be used in phonetic recognition. In the first experiment, a [u]-[i] series was constructed by manipulating the amplitude of a spectral peak (851 Hz). Subjects readily identified an item from the series with a low-amplitude, 851-Hz spectral peak as an [i]. It would appear that this peak, at a low amplitude, is not used at a phonetic level of processing. Further experiments test the perceptual locus of the use (or nonuse) of this low-amplitude peak information. Selective adaptation experiments were run in which the adaptors, including the [i] from the first experiment, varied in spectral overlap with a [u]-[u] test series in order to determine the degree to which the low-amplitude, 851-Hz peak is utilized in processing. The results will be discussed in terms of how peaks are analyzed at different levels of processing and how this relates to various theories of speech perception. [Work supported by NIDCD DC00219.]
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.