Synchronous presentation of stimuli to the auditory and visual systems can modify the formation of a percept in either modality. For example, perception of auditory speech is improved when the speaker's facial articulatory movements are visible. Neural convergence onto multisensory sites exhibiting supra-additivity has been proposed as the principal mechanism for integration. Recent findings, however, have suggested that putative sensory-specific cortices are responsive to inputs presented through a different modality. Consequently, when and where audiovisual representations emerge remain unsettled. In combined psychophysical and electroencephalography experiments we show that visual speech speeds up the cortical processing of auditory signals early (within 100 ms of signal onset). The auditory-visual interaction is reflected as an articulator-specific temporal facilitation (as well as a nonspecific amplitude reduction). The latency facilitation systematically depends on the degree to which the visual signal predicts possible auditory targets. The observed auditory-visual data support the view that there exist abstract internal representations that constrain the analysis of subsequent speech inputs. This is evidence for the existence of an ''analysis-by-synthesis'' mechanism in auditory-visual speech perception. combinations'' such as ''pk '' or ''kp '' but never a fused percept. These results illustrate the effect of input modality on the perceptual AV speech outcome and suggest that multisensory percept formation is systematically based on the informational content of the inputs. In classic speech theories, however, visual speech has seldom been accounted for as a natural source of speech input. Ultimately, when in the processing stream (i.e., at which representational stage) sensory-specific information fuses to yield unified percepts is fundamental for any theoretical, computational, and neuroscientific accounts of speech perception.Recent investigations of AV speech are based on hemodynamic studies that cannot speak directly to timing issues (2, 3). Electroencephalographic (EEG) and magnetoencephalographic (4-7) studies testing AV speech integration have typically used oddball or mismatch negativity paradigms, thus the earliest AV speech interactions have been reported for the 150-to 250-ms mismatch response. Whether systematic AV speech interactions can be documented earlier is controversial, although nonspeech effects can be observed early (8). AV Speech as a Multisensory ProblemSeveral properties of speech are relevant to the present study. (i) Because AV speech is ecologically valid for humans (9, 10), one might predict an involvement of specialized neural computations capable of handling the spectrotemporal complexity of AV speech (compared to, say, arbitrary tone-flash pairings), for which no natural functional relevance can be assumed. (ii) Natural AV speech is characterized by particular dynamics such as (a) the temporal precedence of visual speech (the movement of the facial articulators typically ...
Classic accounts of the benefits of speechreading to speech recognition treat auditory and visual channels as independent sources of information that are integrated fairly early in the speech perception process. The primary question addressed in this study was whether visible movements of the speech articulators could be used to improve the detection of speech in noise, thus demonstrating an influence of speechreading on the ability to detect, rather than recognize, speech. In the first experiment, ten normal-hearing subjects detected the presence of three known spoken sentences in noise under three conditions: auditory-only (A), auditory plus speechreading with a visually matched sentence (AV(M)) and auditory plus speechreading with a visually unmatched sentence (AV(UM). When the speechread sentence matched the target sentence, average detection thresholds improved by about 1.6 dB relative to the auditory condition. However, the amount of threshold reduction varied significantly for the three target sentences (from 0.8 to 2.2 dB). There was no difference in detection thresholds between the AV(UM) condition and the A condition. In a second experiment, the effects of visually matched orthographic stimuli on detection thresholds was examined for the same three target sentences in six subjects who participated in the earlier experiment. When the orthographic stimuli were presented just prior to each trial, average detection thresholds improved by about 0.5 dB relative to the A condition. However, unlike the AV(M) condition, the detection improvement due to orthography was not dependent on the target sentence. Analyses of correlations between area of mouth opening and acoustic envelopes derived from selected spectral regions of each sentence (corresponding to the wide-band speech, and first, second, and third formant regions) suggested that AV(M) threshold reduction may be determined by the degree of auditory-visual temporal coherence, especially between the area of lip opening and the envelope derived from mid- to high-frequency acoustic energy. Taken together, the data (for these sentences at least) suggest that visual cues derived from the dynamic movements of the fact during speech production interact with time-aligned auditory cues to enhance sensitivity in auditory detection. The amount of visual influence depends in part on the degree of correlation between acoustic envelopes and visible movement of the articulators.
Speech intelligibility for audio-alone and audiovisual (AV) sentences was estimated as a function of signal-to-noise ratio (SNR) for a female target talker presented in a stationary noise, an interfering male talker, or a speech-modulated noise background, for eight hearing-impaired (HI) and five normal-hearing (NH) listeners. At the 50% keywords-correct performance level, HI listeners showed 7-12 dB less fluctuating-masker benefit (FMB) than NH listeners, consistent with previous results. Both groups showed significantly more FMB under AV than audio-alone conditions. When compared at the same stationary-noise SNR, FMB differences between listener groups and modalities were substantially smaller, suggesting that most of the FMB differences at the 50% performance level may reflect a SNR dependence of the FMB. Still, 1-5 dB of the FMB difference between listener groups remained, indicating a possible role for reduced audibility, limited spectral or temporal resolution, or an inability to use auditory source-segregation cues, in directly limiting the ability to listen in the dips of a fluctuating masker. A modified version of the extended speech-intelligibility index that predicts a larger FMB at less favorable SNRs accounted for most of the FMB differences between listener groups and modalities. Overall, these data suggest that HI listeners retain more of an ability to listen in the dips of a fluctuating masker than previously thought. Instead, the fluctuating-masker difficulties exhibited by HI listeners may derive from the reduced FMB associated with the more favorable SNRs they require to identify a reasonable proportion of the target speech.
Factors leading to variability in auditory-visual (AV) speech recognition include the subject's ability to extract auditory (A) and visual (V) signal-related cues, the integration of A and V cues, and the use of phonological, syntactic, and semantic context. In this study, measures of A, V, and AV recognition of medial consonants in isolated nonsense syllables and of words in sentences were obtained in a group of 29 hearing-impaired subjects. The test materials were presented in a background of speech-shaped noise at 0-dB signal-to-noise ratio. Most subjects achieved substantial AV benefit for both sets of materials relative to A-alone recognition performance. However, there was considerable variability in AV speech recognition both in terms of the overall recognition score achieved and in the amount of audiovisual gain. To account for this variability, consonant confusions were analyzed in terms of phonetic features to determine the degree of redundancy between A and V sources of information. In addition, a measure of integration ability was derived for each subject using recently developed models of AV integration. The results indicated that (1) AV feature reception was determined primarily by visual place cues and auditory voicing + manner cues, (2) the ability to integrate A and V consonant cues varied significantly across subjects, with better integrators achieving more AV benefit, and (3) significant intra-modality correlations were found between consonant measures and sentence measures, with AV consonant scores accounting for approximately 54% of the variability observed for AV sentence recognition. Integration modeling results suggested that speechreading and AV integration training could be useful for some individuals, potentially providing as much as 26% improvement in AV consonant recognition.
For all but the most profoundly hearing-impaired ͑HI͒ individuals, auditory-visual ͑AV͒ speech has been shown consistently to afford more accurate recognition than auditory ͑A͒ or visual ͑V͒ speech. However, the amount of AV benefit achieved ͑i.e., the superiority of AV performance in relation to unimodal performance͒ can differ widely across HI individuals. To begin to explain these individual differences, several factors need to be considered. The most obvious of these are deficient A and V speech recognition skills. However, large differences in individuals' AV recognition scores persist even when unimodal skill levels are taken into account. These remaining differences might be attributable to differing efficiency in the operation of a perceptual process that integrates A and V speech information. There is at present no accepted measure of the putative integration process. In this study, several possible integration measures are compared using both congruent and discrepant AV nonsense syllable and sentence recognition tasks. Correlations were tested among the integration measures, and between each integration measure and independent measures of AV benefit for nonsense syllables and sentences in noise. Integration measures derived from tests using nonsense syllables were significantly correlated with each other; on these measures, HI subjects show generally high levels of integration ability. Integration measures derived from sentence recognition tests were also significantly correlated with each other, but were not significantly correlated with the measures derived from nonsense syllable tests. Similarly, the measures of AV benefit based on nonsense syllable recognition tests were found not to be significantly correlated with the benefit measures based on tests involving sentence materials. Finally, there were significant correlations between AV integration and benefit measures derived from the same class of speech materials, but nonsignificant correlations between integration and benefit measures derived from different classes of materials. These results suggest that the perceptual processes underlying AV benefit and the integration of A and V speech information might not operate in the same way on nonsense syllable and sentence input. ͓S0001-4966͑98͒03510-3͔
Listeners' accuracy in discriminating one temporal pattern from another was measured in three psychophysical experiments. When the standard pattern consisted of equally timed (isochronic) brief tones, whose interonset intervals (lOIs) were 50, 100, or 200 msec, the accuracy in detecting an asynchrony or deviation of one tone in the sequence was about as would be predicted from older research on the discrimination of single time intervals (~-~at an IOI of 200 msec, 11%-12% at an 101 of 100 msec, and almost 20% at an 101 of 50 msec). In a series of 6 or 10 tones, this accuracy was independent of position of delay for lOIs of 100 and 200 msec. At 50 msec, however, accuracy depended on position, being worst in initial positions and best in final positions. When one tone in a series of six has a frequency different from the others, there is some evidence (at 101 = 200 msec) that interval discrimination is relatively poorer for the tone with the different frequency. Similarly, even if all tones have the same frequency but one interval in the series is made twice as long as the others, temporal discrimination is poorer for the tones bordering the longer interval, although this result is dependent on tempo or 10I. Results with these temporally more complex patterns may be interpreted in part by applying the relative Weber ratio to the intervals before and after the delayed tone. Alternatively, these experiments may show the influence of accent on the temporal discrimination of individual tones.Temporal aspects of auditory perception, including observations on recognition and discrimination of rhythmic patterns as well as models and theories of timing and rhythmic groups, have recently become the subject of a rich body of literature, particularly in music perception. Different studies have emphasized different aspectspreference for certain phrases, recognition of melodic segments, imitative tapping to infer a "representation" of patterns, and so forth. Our interest here, somewhat less grand, concerns listeners' accuracy in the discrimination of temporal patterns.The timing of successive elements in an auditory pattern is critical for the identification both of particular sounds and of characteristics of patterns as a whole. The perception of stress in speech or of accent and rhythmic structure in music requires at least that listeners can discriminate different dimensions of individual sounds, including duration, and also different time intervals separating the onsets of successive sounds. The classical literature on temporal discrimination (Woodrow, 1951) suggests that for a reasonably large range of standard time values, discrimination is reliably good when deviations from the standard are of the order of 10%. Woodrow's chapter dis- tinguishes clearly between two types of intervals used in studies on temporal discrimination: (1) empty intervals, or the time intervening between two boundary events, whether sounds or lights, and (2) continuous stimuli, in which judgments are made about the apparent duration of the events...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.