Listeners experience speech as a sequence of discrete words. However, the real input is a continuously varying acoustic signal that blends words and phonemes into one another. Here we recorded two-hour magnetoencephalograms from 21 subjects listening to stories, in order to investigate how the brain concurrently solves three competing demands: 1) processing overlapping acoustic-phonetic information while 2) keeping track of the relative order of phonemic units and 3) maintaining individuated phonetic information until successful word recognition. We show that the human brain transforms speech input, roughly at the rate of phoneme duration, along a temporally-defined representational trajectory. These representations, absent from the acoustic signal, are active earlier when phonemes are predictable than when they are surprising, and are sustained until lexical ambiguity is resolved. The results reveal how phoneme sequences in natural speech are represented and how they interface with stored lexical items.
One sentence summaryThe human brain keeps track of the relative order of speech sound sequences by jointly encoding content and elapsed processing time Speech comprehension involves mapping non-stationary, highly variable and continuous 1 acoustic signals onto discrete linguistic representations [1]. Although the human experience is 2 typically one of effortless understanding, the computational infrastructure underpinning speech 3 processing remains a major challenge for neuroscience [2] and artificial intelligence systems [3] 4 alike.
5Existing cognitive models primarily serve to explain the recognition of words in isolation 6 [4, 5, 6]. Predictions of these models have gained empirical support in terms of neural encoding 7 of phonetic features [7, 8, 9, 10], and interactions between phonetic and (sub)lexical units of 8 representation [11, 12, 13, 14, 15]. What is not well understood, and what such models largely 9 ignore, however, is how sequences of acoustic-phonetic signals (e.g. the phonemes k-a-t) are 10 mapped to lexical items (e.g. cat) during comprehension of naturalistic continuous speech.
11One substantial challenge is that naturalistic language does not come pre-parsed: there are, 12 e.g. no reliable cues for word boundaries, and adjacent speech sounds (phonemes) acoustically 13 overlap both within and across words due to co-articulation [1]. In addition, the same sequence 14 of phonemes can form completely different words (e.g. pets versus pest), so preserving phoneme 15 order is critical. Furthermore, phonemes elicit a cascade of neural responses, which long surpass 16 the duration of the phonemes themselves [16, 17, 9]). This means, concretely, that a given 17 phoneme i is still present in both the acoustic and neural signals while subsequent phonemes 18 stimulate the cochlea. Such signal complexity presents serious challenges for the key goals of 19 achieving invariance and perceptual constancy in spoken language comprehension.
20Based on decoding analyses of acoustic and neural data we show how t...