Visual capture and the ventriloquism aftereffect resolve spatial disparities of incongruent auditory-visual (AV) objects by shifting auditory spatial perception to align with vision. Here, we demonstrated the distinct temporal characteristics of visual capture and the ventriloquism aftereffect in response to brief AV disparities. In a set of experiments, subjects localized either the auditory component of AV targets (A within AV) or a second sound presented at varying delays (1-20s) after AV exposure (A2 after AV). AV targets were trains of brief presentations (1 or 20), covering a ±30° azimuthal range, and with ±8° (R or L) disparity. We found that the magnitude of visual capture generally reached its peak within a single AV pair and did not dissipate with time, while the ventriloquism aftereffect accumulated with repetitions of AV pairs and dissipated with time. Additionally, the magnitude of the auditory shift induced by each phenomenon was uncorrelated across listeners and visual capture was unaffected by subsequent auditory targets, indicating that visual capture and the ventriloquism aftereffect are separate mechanisms with distinct effects on auditory spatial perception. Our results indicate that visual capture is a ‘sample-and-hold’ process that binds related objects and stores the combined percept in memory, whereas the ventriloquism aftereffect is a ‘leaky integrator’ process that accumulates with experience and decays with time to compensate for cross-modal disparities.
The ventriloquism aftereffect (VAE) refers to a shift in auditory spatial perception following exposure to a spatial disparity between auditory and visual stimuli. The VAE has been previously measured on two distinct time scales. Hundreds or thousands of exposures to a an audio-visual spatial disparity produces enduring VAE that persists after exposure ceases. Exposure to a single audio-visual spatial disparity produces immediate VAE that decays over seconds. To determine if these phenomena are two extremes of a continuum or represent distinct processes, we conducted an experiment with normal hearing listeners that measured VAE in response to a repeated, constant audio-visual disparity sequence, both immediately after exposure to each audio-visual disparity and after the end of the sequence. In each experimental session, subjects were exposed to sequences of auditory and visual targets that were constantly offset by +8° or −8° in azimuth from one another, then localized auditory targets presented in isolation following each sequence. Eye position was controlled throughout the experiment, to avoid the effects of gaze on auditory localization. In contrast to other studies that did not control eye position, we found both a large shift in auditory perception that decayed rapidly after each AV disparity exposure, along with a gradual shift in auditory perception that grew over time and persisted after exposure to the AV disparity ceased. We modeled the temporal and spatial properties of the measured auditory shifts using grey box nonlinear system identification, and found that two models could explain the data equally well. In the power model, the temporal decay of the ventriloquism aftereffect was modeled with a power law relationship. This causes an initial rapid drop in auditory shift, followed by a long tail which accumulates with repeated exposure to audio-visual disparity. In the double exponential model, two separate processes were required to explain the data, one which accumulated and decayed exponentially and the other which slowly integrated over time. Both models fit the data best when the spatial spread of the ventriloquism aftereffect was limited to a window around the location of the audio-visual disparity. We directly compare the predictions made by each model, and suggest additional measurements that could help distinguish which model best describes the mechanisms underlying the VAE.
Band importance functions estimate the relative contribution of individual acoustic frequency bands to speech intelligibility. Previous studies of band importance in listeners with cochlear implants have used experimental maps and direct stimulation. Here, band importance was estimated for clinical maps with acoustic stimulation. Listeners with cochlear implants had band importance functions that relied more heavily on lower frequencies and showed less cross-listener consistency than in listeners with normal hearing. The intersubject variability observed here indicates that averaging band importance functions across listeners with cochlear implants, as has been done in previous studies, may not be meaningful. Additionally, band importance functions of listeners with normal hearing for vocoded speech that either did or did not simulate spread of excitation were not different from one another, suggesting that additional factors beyond spread of excitation are necessary to account for changes in band importance in listeners with cochlear implants.
Purpose The goal of this study was to determine how various aspects of cognition predict speech recognition ability across different levels of speech vocoding within a single group of listeners. Method We tested the ability of young adults ( N = 32) with normal hearing to recognize Perceptually Robust English Sentence Test Open-set (PRESTO) sentences that were degraded with a vocoder to produce different levels of spectral resolution (16, eight, and four carrier channels). Participants also completed tests of cognition (fluid intelligence, short-term memory, and attention), which were used as predictors of sentence recognition. Sentence recognition was compared across vocoder conditions, predictors were correlated with individual differences in sentence recognition, and the relationships between predictors were characterized. Results PRESTO sentence recognition performance declined with a decreasing number of vocoder channels, with no evident floor or ceiling performance in any condition. Individual ability to recognize PRESTO sentences was consistent relative to the group across vocoder conditions. Short-term memory, as measured with serial recall, was a moderate predictor of sentence recognition (ρ = 0.65). Serial recall performance was constant across vocoder conditions when measured with a digit span task. Fluid intelligence was marginally correlated with serial recall, but not sentence recognition. Attentional measures had no discernible relationship to sentence recognition and a marginal relationship with serial recall. Conclusions Verbal serial recall is a substantial predictor of vocoded sentence recognition, and this predictive relationship is independent of spectral resolution. In populations that show variable speech recognition outcomes, such as listeners with cochlear implants, it should be possible to account for the independent effects of spectral resolution and verbal serial recall in their speech recognition ability. Supplemental Material https://doi.org/10.23641/asha.12021051
Vision typically has better spatial accuracy and precision than audition, and as a result often captures auditory spatial perception when visual and auditory cues are presented together. One determinant of visual capture is the amount of spatial disparity between auditory and visual cues: when disparity is small visual capture is likely to occur, and when disparity is large visual capture is unlikely. Previous experiments have used two methods to probe how visual capture varies with spatial disparity. First, congruence judgment assesses perceived unity between cues by having subjects report whether or not auditory and visual targets came from the same location. Second, auditory localization assesses the graded influence of vision on auditory spatial perception by having subjects point to the remembered location of an auditory target presented with a visual target. Previous research has shown that when both tasks are performed concurrently they produce similar measures of visual capture, but this may not hold when tasks are performed independently. Here, subjects alternated between tasks independently across three sessions. A Bayesian inference model of visual capture was used to estimate perceptual parameters for each session, which were compared across tasks. Results demonstrated that the range of audio-visual disparities over which visual capture was likely to occur were narrower in auditory localization than in congruence judgment, which the model indicates was caused by subjects adjusting their prior expectation that targets originated from the same location in a task-dependent manner.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.