The natural ecology of human language is face-to-face interaction, comprising cues, like cospeech gestures, mouth movements and prosody, tightly synchronized with speech. Yet, this rich multimodal context is usually stripped away in experimental studies as the dominant paradigm focuses on speech alone. We ask how these audio-visual cues impact brain activity during naturalistic language comprehension, how they are dynamically orchestrated and whether they are organized hierarchically. We quantify each cue in video-clips of a speaker and we used a well-established electroencephalographic marker of comprehension difficulties, an event-related potential, peaking around 400ms after word-onset. We found that multimodal cues always modulated brain activity in interaction with speech, that their impact dynamically changes with their informativeness and that there is a hierarchy: prosody shows the strongest effect followed by gestures and mouth movements. Thus, this study provides a first snapshot into how the brain dynamically weights audiovisual cues in real-world language comprehension. Electrophysiology of multimodal comprehension ! 3 Electrophysiology of multimodal comprehension ! 4 frame theories of natural language processing because if some multimodal cues (e.g., gesture or prosody) always contribute to processing, this would imply that our current speech-only focus is too narrow, if not misleading. Second, we need to understand the dynamics of online multimodal comprehension. In particular, to provide mechanistic accounts of language comprehension, it is necessary to establish how the weight of a certain cue dynamically changes depending upon the context (e.g., whether meaningful hand gestures are weighted more when prior linguistic context is less informative and/or when mouth movements are less informative). Finally, it is important to establish whether there is a stable hierarchical organization of cues (e.g., prior linguistic context may always be weighted more than gestures, which are in turn weighted more than mouth movements).
Prosody, gesture and mouth movements as predictors of upcoming words: the state of the artAccentuation (i.e., prosodic stress characterized as higher pitch that makes words acoustically prominent) marks new information 10 . Many behavioural studies have revealed that comprehension is facilitated with appropriate accentuation (new information is accentuated, and old information de-accentuated 11,12 . Incongruence between the presence of prosodic accentuation and newness of information increases processing difficulty, inducing increased activation in left inferior frontal gyrus, interpreted as increased phonological and semantic processing difficulty 13 . In electrophysiological (EEG) studies, such mismatch elicits more negative N400 (an event-related-potential (ERP) peaking negatively 400ms after word presentation around central-parietal areas 14 , that has been argued to mark prediction in language comprehension 2 ) than appropriate accentuation 15-20 .Electrophysiology of multimod...