Prediction across sensory modalities: A neurocomputational model of the McGurk effect

Olasagasti, Itsaso; Bouton, Sophie; Giraud, Anne-Lise

doi:10.1016/j.cortex.2015.04.008

Cited by 39 publications

(54 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While the mechanisms leading to AV speech fusion are relatively well understood, those leading to AV stimulus combination are still unknown. Based on a previous computational model, we conjectured that AV combination follows from the difficulty to map the auditory and visual physical features in a multisensory space presumably located in the LSTS 28 . AV combination would hence result in a more demanding processing sequence than AV fusion, Figure 1B).…”

Section: Discussionmentioning

confidence: 97%

See 1 more Smart Citation

Audio-visual combination of syllables involves time-sensitive dynamics following from fusion failure

Bouton

Delgado-Saa

Olasagasti

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

words)In face-to-face communication, audio-visual (AV) stimuli can be fused, combined or perceived as mismatching. While the left superior temporal sulcus (LSTS) is admittedly the locus of AV integration, the process leading to combination is unknown. Analysing behaviour and time-/source-resolved human MEG data, we show that fusion and combination both involve early detection of AV physical features discrepancy in the LSTS, but that this initial registration is followed, in combination only, by the activation of AV asynchrony-sensitive regions (auditory and inferior frontal cortices). Based on dynamic causal modelling and neural signal decoding, we further show that AV speech integration outcome primarily depends on whether the LSTS quickly converges or not onto an existing multimodal syllable representation, and that combination results from subsequent temporal re-ordering of the discrepant AV stimuli in time-sensitive regions of the prefrontal and temporal cortices. Keywords Audio-visual integration, Combination, McGurk effect, Neural dynamics, Audio-visual asynchrony./abga/ or /agba/. What determines whether AV stimuli are going to be fused 2-4 or combined 5 , and the underlying neural dynamics of such a perceptual divergence is not known yet.Audio-visual speech integration draws on a number of processing steps distributed over several cortical regions, including auditory and visual cortices, the left posterior temporal cortex, and higher-level language regions of the left prefrontal 6,7 and anterior temporal cortex 8,9 . In this cortical hierarchy, the left superior temporal sulcus (LSTS) plays a central role in integrating visual and auditory inputs from the visual motion area (mediotemporal cortex, MT) and the auditory cortex (AC) 10-15 . The LSTS is characterized by relatively smooth temporal integration properties that enables it to cope with the natural asynchrony between auditory and visual speech inputs, i.e. the fact that orofacial speech movements often start before the sounds they produce 4,16,17 . Although the LSTS responds better when auditory and visual speech are perfectly synchronous 18,19 , its activity can cope with large temporal discrepancies, reflecting a broad temporal window of integration in the order of the syllable length (up to ~260 ms) 20 . This large window of integration can even be pathologically stretched to about 1s in subjects suffering from autism spectrum disorder 21 .Yet, the detection of shorter temporal AV asynchronies is possible and takes place in other brain regions, in particular in the dorsal premotor area and the inferior frontal gyrus [22][23][24][25] .considering the temporal patterns in a 2 nd acoustic formant/lip aperture two-dimensional (2D) feature space is sufficient to qualitatively reproduce participants' behaviour for fused 13,29 but also combined responses 28 . Simulations indicated that fusion is possible, and even expected, when the physical features of the A and V stimulus, represented by the 2 nd formant and lip in the model, are located in the neighb...

show abstract

Section: Discussionmentioning

confidence: 97%

“…However, according to our predictive model of AV syllable integration 28 an interesting illustration of how predictive coding could apply to AV integration 28,57,58 .…”

Section: The Role Of the Lsts In The Fusion/combination Dynamic Divermentioning

confidence: 95%

Audio-visual combination of syllables involves time-sensitive dynamics following from fusion failure

Bouton

Delgado-Saa

Olasagasti

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…As described above it proposes that syllables are encoded in terms of the expected amplitudes and variances of audiovisual features. The expected amplitudes were taken from the mean values across 10 productions from a single male speaker (Olasagasti, Bouton, and Giraud 2015), the amplitudes were then normalized by dividing by the highest value for each feature. The remaining parameters in the model, variances and sensory noise levels, were chosen so that the overall categorization results, percentage of /aba/, /ada/ and /aga/ responses to the 6 types of stimuli were qualitatively similar to those reported by Lüttke and colleagues.…”

Section: Model Simulationsmentioning

confidence: 99%

“…The lip and 2 nd formant temporal modulation profiles (M V (t) and M A (t), Fig 1B) were defined as in (Olasagasti, Bouton, and Giraud 2015). Each profile, representing the intervocalic transition between the two vowels in the "a" vocalic context, was modelled with 289 time points.…”

Section: Model Simulationsmentioning

confidence: 99%

“…To test our hypothesis we simulated Lüttke et al's (Lüttke et al 2016) experiment, using a model of audiovisual integration based on hierarchical Bayesian inference. We explicitly included a model for the McGurk effect based on our previous work (Olasagasti, Bouton, and Giraud 2015) and supplemented it with an adaptation mechanism that uses residual prediction error to update the internal representation -in the form of a hierarchical generative model-associated with the perceived category. The model divides the process in two steps; the first one is an on-line (moment-by-moment) prediction error minimization during the perceptual inference process, at the end of which there might remain "residual" prediction errors.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Integrating prediction errors at two time scales permits rapid recalibration of speech sound categories

Olasagasti

Giraud

2018

Preprint

Self Cite

View full text Add to dashboard Cite

Speech perception is assumed to arise from internal models of specific sensory features associated speech sounds. When these features change, the listener should recalibrate its internal model by appropriately weighing new versus old evidence in a volatility dependent manner. Models of speech recalibration have classically ignored volatility. Those that explicitly consider volatility have been designed to describe human behavior in tasks where sensory cues are associated with arbitrary experimenter-defined categories or rewards. In such settings, a model that maintains a single representation of the category but continuously adapts the learning rate works well. Using neurocomputational modelling we show that recalibration of existing "natural" categories is better described when sound categories are represented at different time scales. We illustrate our proposal by modeling the rapid recalibration of speech categories (Lüttke et al. 2016).

show abstract

Prediction in speech and language processing

Tavano

Scharinger

2015

Cortex

View full text Add to dashboard Cite

Prediction across sensory modalities: A neurocomputational model of the McGurk effect

Cited by 39 publications

References 52 publications

Audio-visual combination of syllables involves time-sensitive dynamics following from fusion failure

Audio-visual combination of syllables involves time-sensitive dynamics following from fusion failure

Integrating prediction errors at two time scales permits rapid recalibration of speech sound categories

Prediction in speech and language processing

Contact Info

Product

Resources

About