When members of a series of synthesized stop consonants varying acoustically in F3 characteristics and varying perceptually from fdaf to fgaf are preceded by fall, subjects report hearing more fgaf syllables relative to when each member is preceded by farf (Mann, 1980).It has been suggested that this result demonstrates the existence of a mechanism that compensates for coarticulation via tacit knowledge of articulatory dynamics and constraints, or through perceptual recovery of vocal-tract dynamics. The present study was designed to assess the degree to which these perceptual effects are specific to qualities of human articulatory sources. In three experiments, series of consonant-vowel (CV)stimuli varying in F3-onset frequency (/daf-fgal) were preceded by speech versions or nonspeech analogues of fall and lest. The effect of liquid identity on stop consonant labeling remained when the preceding VC was produced by a female speaker and the CV syllable was modeled after a male speaker's productions. Labeling boundaries also shifted when the CV was preceded by a sine wave glide modeled after F3 characteristics of fall and farf. Identifications shifted even when the preceding sine wave was of constant frequency equal to the offset frequency of F3 from a natural production. These results suggest an explanation in terms of general auditory processes as opposed to recovery of or knowledge of specific articulatory dynamics.Despite 40 years of sustained effort to develop machine speech-recognition devices, no engineering approach to speech perception has achieved the success ofan average 2-year-old human. One of the more daunting aspects of speech for these efforts is the acoustic effects of coarticulation. Traditionally, coarticulation refers to the spatial and temporal overlap of adjacent articulatory activities. This is reflected in the acoustic signal by severe context dependence; acoustic information specifying one phoneme varies substantially, depending on surrounding phonemes. As a result, there is a lack ofinvariance between linguistic units (e.g., phonemes, morphemes) and attributes of the acoustic signal. This poses quite a problem for speech-recognition devices which are designed to output strings of phonemes. ' An example of coarticulatory influence is the effect of a preceding liquid on the acoustic realization of a subsequent stop consonant. Mann (1980) reports that articulation of the syllables fdal and Igal may be influenced by the production of a preceding lall or lar/. Articulatorily described, the physical realization of the phonemes Idl and Igl primarily differ in the place at which the tongue occludes the vocal tract. For a velar stop [g], the tongue body is raised against the soft palate at the rear of the mouth, whereas for an alveolar stop [d], the tongue tip comes in contact with the alveolar ridge toward the front ofthe oral cavity behind the teeth. The liquids III and Irl differ in a similar manner; an [r] is produced with the tongue raised toward the rear of the cavity, and an [1] is produce...
When members of a series of synthesized stop consonants varying in third-formant (F3) characteristics and varying perceptually from /da/ to /ga/ are preceded by /al/, human listeners report hearing more /ga/ syllables than when the members of the series are preceded by /ar/. It has been suggested that this shift in identification is the result of specialized processes that compensate for acoustic consequences of coarticulation. To test the species-specificity of this perceptual phenomenon, data were collected from nonhuman animals in a syllable ''labeling'' task. Four Japanese quail ͑Coturnix coturnix japonica͒ were trained to peck a key differentially to identify clear /da/ and /ga/ exemplars. After training, ambiguous members of a /da/-/ga/ series were presented in the context of /al/ and /ar/ syllables. Pecking performance demonstrated a shift which coincided with data from humans. These results suggest that processes underlying ''perceptual compensation for coarticulation'' are species-general. In addition, the pattern of response behavior expressed is rather common across perceptual systems.
Japanese quail (Coturnix coturnix) learned a category for syllable-initial [d] followed by a dozen different vowels. After learning to categorize syllables consisting of [d], [b], or [g] followed by four different vowels, quail correctly categorized syllables in which the same consonants preceded eight novel vowels. Acoustic analysis of the categorized syllables revealed no single feature or pattern of features that could support generalization, suggesting that the quail adopted a more complex mapping of stimuli into categories. These results challenge theories of speech sound classification that posit uniquely human capacities.
Four experiments explored the relative contributions of spectral content and phonetic labeling in effects of context on vowel perception. Two 10-step series of CVC syllables ͓͑bVb͔ and ͓dVd͔͒ varying acoustically in F2 midpoint frequency and varying perceptually in vowel height from ͓#͔ to ͓}͔ were synthesized. In a forced-choice identification task, listeners more often labeled vowels as ͓#͔ in ͓dVd͔ context than in ͓bVb͔ context. To examine whether spectral content predicts this effect, nonspeech-speech hybrid series were created by appending 70-ms sine-wave glides following the trajectory of CVC F2's to 60-ms members of a steady-state vowel series varying in F2 frequency. In addition, a second hybrid series was created by appending constant-frequency sine-wave tones equivalent in frequency to CVC F2 onset/offset frequencies. Vowels flanked by frequencymodulated glides or steady-state tones modeling ͓dVd͔ were more often labeled as ͓#͔ than were the same vowels surrounded by nonspeech modeling ͓bVb͔. These results suggest that spectral content is important in understanding vowel context effects. A final experiment tested whether spectral content can modulate vowel perception when phonetic labeling remains intact. Voiceless consonants, with lower-amplitude more-diffuse spectra, were found to exert less of an influence on vowel perception than do their voiced counterparts. The data are discussed in terms of a general perceptual account of context effects in speech perception.
Natural sounds are complex, typically changing along multiple acoustic dimensions that covary in accord with physical laws governing sound-producing sources. We report that, after passive exposure to novel complex sounds, highly correlated features initially collapse onto a single perceptual dimension, capturing covariance at the expense of unitary stimulus dimensions. Discriminability of sounds respecting the correlation is maintained, but is temporarily lost for sounds orthogonal or oblique to experienced covariation. Following extended experience, perception of variance not captured by the correlation is restored, but weighted only in proportion to total experienced covariance. A Hebbian neural network model captures some aspects of listener performance; an anti-Hebbian model captures none; but, a principal components analysis model captures the full pattern of results. Predictions from the principal components analysis model also match evolving listener performance in two discrimination tasks absent passive listening. These demonstrations of adaptation to correlated attributes provide direct behavioral evidence for efficient coding.auditory perception | cortical models | perceptual organization
Perceptual systems in all modalities are predominantly sensitive to stimulus change, and many examples of perceptual systems responding to change can be portrayed as instances of enhancing contrast. Multiple findings from perception experiments serve as evidence for spectral contrast explaining fundamental aspects of perception of coarticulated speech, and these findings are consistent with a broad array of known psychoacoustic and neurophysiological phenomena. Beyond coarticulation, important characteristics of speech perception that extend across broader spectral and temporal ranges may best be accounted for by the constant calibration of perceptual systems to maximize sensitivity to change. Sensorineural systems respond to changeIt is both true and fortunate that sensorineural systems respond to change and to little else. Perceptual systems do not record absolute level be it loudness, pitch, brightness, or color. This fact has been demonstrated in every sensory domain. Physiologically, sensory encoding is always relative. This sacrifice of absolute encoding has enormous benefits along the way to maximizing information transmission. Biological sensors have impressive dynamic range given their evolution via borrowed parts (e.g., gill arches becoming middle ear bones). However, biological dynamic range always is a small fraction of the physical range of absolute levels available in the environment as well as in the perceptual range essential to organisms' survival. This is true whether one is considering optical luminance or acoustic pressure. The beauty of sensory systems is that, by responding to relative change, a limited dynamic range adjusts to maximize the amount of change that can be detected in the environment.The simplest way that sensory systems adjust dynamic range to maximize sensitivity to change is via adaptation. Following nothing, a sensory stimulus triggers a strong sensation. However, when sustained sensory input does not change over time, constant stimulation loses impact. This sort of sensory attenuation due to adaptation is ubiquitous, and has been documented in vision (Riggs et al
Speech sounds are traditionally divided into consonants and vowels. When only vowels or only consonants are replaced by noise, listeners are more accurate understanding sentences in which consonants are replaced but vowels remain. From such data, vowels have been suggested to be more important for understanding sentences; however, such conclusions are mitigated by the fact that replaced consonant segments were roughly one-third shorter than vowels. We report two experiments that demonstrate listener performance to be better predicted by simple psychoacoustic measures of cochlea-scaled spectral change across time. First, listeners identified sentences in which portions of consonants (C), vowels (V), CV transitions, or VC transitions were replaced by noise. Relative intelligibility was not well accounted for on the basis of Cs, Vs, or their transitions. In a second experiment, distinctions between Cs and Vs were abandoned. Instead, portions of sentences were replaced on the basis of cochlea-scaled spectral entropy (CSE). Sentence segments having relatively high, medium, or low entropy were replaced with noise. Intelligibility decreased linearly as the amount of replaced CSE increased. Duration of signal replaced and proportion of consonants/vowels replaced fail to account for listener data. CSE corresponds closely with the linguistic construct of sonority (or vowel-likeness) that is useful for describing phonological systematicity, especially syllable composition. Results challenge traditional distinctions between consonants and vowels. Speech intelligibility is better predicted by nonlinguistic sensory measures of uncertainty (potential information) than by orthodox physical acoustic measures or linguistic constructs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.