In a spoken utterance, a talker expresses linguistic constituents in serial order. A listener resolves these linguistic properties in the rapidly fading auditory sample. Classic measures agree that auditory integration occurs at a fine temporal grain. In contrast, recent studies have proposed that sensory integration of speech occurs at a coarser grain approximate to the syllable, based on indirect and relatively insensitive perceptual measures. Evidence from cognitive neuroscience and behavioral primatology has also been adduced to support the claim of sensory integration at the pace of syllables. The present investigation uses direct performance measures of integration, applying an acoustic technique to isolate the contribution of short-term acoustic properties to the assay of modulation sensitivity. In corroborating the classic finding of a fine temporal grain of integration, these functional measures can inform theory and speculation in accounts of speech perception.
Speech remains intelligible despite the elimination of canonical acoustic correlates of phonemes from the spectrum. A portion of this perceptual flexibility can be attributed to modulation sensitivity in the auditory-to-phonetic projection, though signal-independent properties of lexical neighborhoods also affect intelligibility in utterances composed of words. Three tests were conducted to estimate the effects of exposure to natural and sine-wave samples of speech in this kind of perceptual versatility. First, sine-wave versions of the easy/hard word sets were created, modeled on the speech samples of a single talker. The performance difference in recognition of easy and hard words was used to index the perceptual reliance on signal-independent properties of lexical contrasts. Second, several kinds of exposure produced familiarity with an aspect of sine-wave speech: 1) sine-wave sentences modeled on the same talker; 2) sine-wave sentences modeled on a different talker, to create familiarity with a sine-wave carrier; and 3) natural sentences spoken by the same talker, to create familiarity with the idiolect expressed in the sine-wave words. Recognition performance with both easy and hard sine-wave words improved after exposure only to sine-wave sentences modeled on the same talker. Third, a control test showed that signal-independent uncertainty is a plausible cause of differences in recognition of easy and hard sine-wave words. The conditions of beneficial exposure reveal the specificity of attention underlying versatility in speech perception.
Linear prediction is a widely available technique for analyzing acoustic properties of speech, although this method is known to be error-prone. New tests assessed the adequacy of linear prediction estimates by using this method to derive synthesis parameters and testing the intelligibility of the synthetic speech that results. Matched sets of sine-wave sentences were created, one set using uncorrected linear prediction estimates of natural sentences, the other using estimates made by hand. Phoneme restrictions imposed on linguistic properties allowed comparisons between continuous and intermittent voicing, oral or nasal and fricative manner, and unrestricted phonemic variation. Intelligibility tests revealed uniformly good performance with sentences created by handestimation and a minimal decrease in intelligibility with estimation by linear prediction due to manner variation with continuous voicing. Poorer performance was observed when linear prediction estimates were used to produce synthetic versions of phonemically unrestricted sentences, but no similar decline was observed with synthetic sentences produced by hand estimation. The results show a substantial intelligibility cost of reliance on uncorrected linear prediction estimates when phonemic variation approaches natural incidence.
Speech signal components that are desynchronized from the veridical temporal pattern lose intelligibility. In contrast, audiovisual presentations with large desynchrony in visible and audible speech streams are perceived without loss of integration. Under such conditions, the limit of desynchrony that permits audiovisual integration is also adaptable. A new project directly investigated the potential for adaptation to consistent desynchrony with unimodal auditory sine-wave speech. Listeners transcribed sentences that are highly intelligible, with veridical temporal properties. Desynchronized variants were created by leading or lagging the tone analog of the second formant relative to the rest of the tones composing the sentences, in 50-msec steps, ranging from 250-msec lead to 250-msec lag. In blocked trials, listeners only tolerated desynchronies <50 msec, and exhibited no gain in intelligibility to consistent desynchrony. Unimodal auditory and bimodal audiovisual forms of perceptual integration evidently exhibit different temporal characteristics, an indication of distinct perceptual functions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.