Three experiments are reported on the influence of different timing relations on the McGurk effect. In the first experiment, it is shown that strict temporal synchrony between auditory and visual speech stimuli is not required for the McGurk effect. Subjects were strongly influenced by the visual stimuli when the auditory stimuli lagged the visual stimuli by as much as 180 msec. In addition, a stronger McGurk effect was found when the visual and auditory vowels matched. In the second experiment, we paired auditory and visual speech stimuli produced under different speaking conditions (fast, normal, clear). The results showed that the manipulations in both the visual and auditory speaking conditions independently influenced perception. In addition, there was a small but reliable tendency for the better matched stimuli to elicit more McGurk responses than unmatched conditions. In the third experiment, we combined auditory and visual stimuli produced under different speaking conditions (fast, clear) and delayed the acoustics with respect to the visual stimuli. The subjects showed the same pattern of results as in the second experiment. Finally, the delay did not cause different patterns of results for the different audiovisual speaking style combinations. The results suggest that perceivers may be sensitive to the concordance of the time-varying aspects of speech but they do not require temporal coincidence of that information.When the face moves during speech production it provides information about the place of articulation as well as the class of phoneme that is produced. Evidence from studies of lipreading as well as studies of speech in noise (e.g., Sumby & Pollack, 1954) suggest that perceivers can gain significant amounts of information about the speech target through the visual channel. How this information is combined with speech acoustics to form a single percept, however, is not clear. One useful approach to studying audiovisual integration in speech is to dub various auditory stimuli onto different visual speech stimuli. When a discrepancy exists between the information from the two modalities, subjects fuse the visual and auditory information to form a new percept. For example, when the face articulates /gi/ and the auditory stimulus is /bi/, many subjects report hearing /di/.
Auditory feedback during speech production is known to play a role in speech sound acquisition and is also important for the maintenance of accurate articulation. In two studies the first formant (F1) of monosyllabic consonant-vowel-consonant words (CVCs) was shifted electronically and fed back to the participant very quickly so that participants perceived the modified speech as their own productions. When feedback was shifted up (experiment 1 and 2) or down (experiment 1) participants compensated by producing F1 in the opposite frequency direction from baseline. The threshold size of manipulation that initiated a compensation in F1 was usually greater than 60 Hz. When normal feedback was returned, F1 did not return immediately to baseline but showed an exponential deadaptation pattern. Experiment 1 showed that this effect was not influenced by the direction of the F1 shift, with both raising and lowering of F1 exhibiting the same effects. Experiment 2 showed that manipulating the number of trials that F1 was held at the maximum shift in frequency (0, 15, 45 trials) did not influence the recovery from adaptation. There was a correlation between the lag-one autocorrelation of trial-to-trial changes in F1 in the baseline recordings and the magnitude of compensation. Some participants therefore appeared to more actively stabilize their productions from trial-to-trial. The results provide insight into the perceptual control of speech and the representations that govern sensorimotor coordination.
Hearing one's own speech is important for language learning and maintenance of accurate articulation. For example, people with postlinguistically acquired deafness often show a gradual deterioration of many aspects of speech production. In this manuscript, data are presented that address the role played by acoustic feedback in the control of voice fundamental frequency (F0). Eighteen subjects produced vowels under a control (normal F0 feedback) and two experimental conditions: F0 shifted up and F0 shifted down. In each experimental condition subjects produced vowels during a training period in which their F0 was slowly shifted without their awareness. Following this exposure to transformed F0, their acoustic feedback was returned to normal. Two effects were observed. Subjects compensated for the change in F0 and showed negative aftereffects. When F0 feedback was returned to normal, the subjects modified their produced F0 in the opposite direction to the shift. The results suggest that fundamental frequency is controlled using auditory feedback and with reference to an internal pitch representation. This is consistent with current work on internal models of speech motor control.
Perceiver eye movements were recorded during audiovisual presentations of extended monologues. Monologues were presented at different image sizes and with different levels of acoustic masking noise. Two clear targets of gaze fixation were identified, the eyes and the mouth. Regardless of image size, perceivers of both Japanese and English gazed more at the mouth as masking noise levels increased. However, even at the highest noise levels and largest image sizes, subjects gazed at the mouth only about half the time. For the eye target, perceivers typically gazed at one eye more than the other, and the tendency became stronger at higher noise levels. English perceivers displayed more variety of gaze-sequence patterns (e.g., left eye to mouth to left eye to right eye) and persisted in using them at higher noise levels than did Japanese perceivers. No segment-level correlations were found between perceiver eye motions and phoneme identity of the stimuli.
Auditory feedback influences human speech production, as demonstrated by studies using rapid pitch and loudness changes. Feedback has also been investigated using the gradual manipulation of formants in adaptation studies with whispered speech. In the work reported here, the first formant of steady-state isolated vowels was unexpectedly altered within trials for voiced speech. This was achieved using a real-time formant tracking and filtering system developed for this purpose. The first formant of vowel /epsilon/ was manipulated 100% toward either /ae/ or /I/, and participants responded by altering their production with average Fl compensation as large as 16.3% and 10.6% of the applied formant shift, respectively. Compensation was estimated to begin <460 ms after stimulus onset. The rapid formant compensations found here suggest that auditory feedback control is similar for both F0 and formants.
This fMRI study explores brain regions involved with perceptual enhancement afforded by observation of visual speech gesture information. Subjects passively identified words presented in the following conditions: audio-only, audiovisual, audio-only with noise, audiovisual with noise, and visual only. The brain may use concordant audio and visual information to enhance perception by integrating the information in a converging multisensory site. Consistent with response properties of multisensory integration sites, enhanced activity in middle and superior temporal gyrus/sulcus was greatest when concordant audiovisual stimuli were presented with acoustic noise. Activity found in brain regions involved with planning and execution of speech production in response to visual speech presented with degraded or absent auditory stimulation, is consistent with the use of an additional pathway through which speech perception is facilitated by a process of internally simulating the intended speech act of the observed speaker.
Social interaction involves the active visual perception of facial expressions and communicative gestures. This study examines the distribution of gaze fixations while watching videos of expressive talking faces. The knowledge-driven factors that influence the selective visual processing of facial information were examined by using the same set of stimuli, and assigning subjects to either a speech recognition task or an emotion judgment task. For half of the subjects assigned to each of the tasks, the intelligibility of the speech was manipulated by the addition of moderate masking noise. Both tasks and the intelligibility of the speech signal influenced the spatial distribution of gaze. Gaze was concentrated more on the eyes when emotion was being judged as compared to when words were being identified. When noise was added to the acoustic signal, gaze in both tasks was more centralized on the face. This shows that subject's gaze is sensitive to the distribution of information on the face, but can also be influenced by strategies aimed at maximizing the amount of visual information processed.
A computerized pulsed-ultrasound system was used to monitor tongue dorsum movements during the production of consonant-vowel sequences in which speech rate, vowel, and consonant were varied. The kinematics of tongue movement were analyzed by measuring the lowering gesture of the tongue to give estimates of movement amplitude, duration, and maximum velocity. All three subjects in the study showed reliable correlations between the amplitude of the tongue dorsum movement and its maximum velocity. Further, the ratio of the maximum velocity to the extent of the gesture, a kinematic indicator of articulator stiffness, was found to vary inversely with the duration of the movement. This relationship held both within individual conditions and across all conditions in the study such that a single function was able to accommodate a large proportion of the variance due to changes in movement duration. As similar findings have been obtained both for abduction and adduction gestures of the vocal folds and for rapid voluntary limb movements, the data suggest that a wide range of changes in the duration of individual movements might all have a similar origin. The control of movement rate and duration through the specification of biomechanical characteristics of speech articulators is discussed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.