A decade after the first successful attempt to decode speech directly from human brain signals, accuracy and speed remain far below that of natural speech or typing. Here we show how to achieve high accuracy from the electrocorticogram at natural-speech rates, even with few data (on the order of half an hour of spoken speech). Taking a cue from recent advances in machine translation and automatic speech recognition, we train a recurrent neural network to map neural signals directly to word sequences (sentences). In particular, the network first encodes a sentence-length sequence of neural activity into an abstract representation, and then decodes this representation, word by word, into an English sentence. For each participant, training data consist of several spoken repeats of a set of some 30-50 sentences, along with the corresponding neural signals at each of about 250 electrodes distributed over peri-Sylvian speech cortices. Average word error rates across a validation (held-out) sentence set are as low as 7% for some participants, as compared to the previous state of the art of greater than 60%. Finally, we show how to use transfer learning to overcome limitations on data availability: Training certain components of the network under multiple participants' data, while keeping other components (e.g., the first hidden layer) "proprietary," can improve decoding performance-despite very different electrode coverage across participants.
Natural communication often occurs in dialogue, differentially engaging auditory and sensorimotor brain regions during listening and speaking. However, previous attempts to decode speech directly from the human brain typically consider listening or speaking tasks in isolation. Here, human participants listened to questions and responded aloud with answers while we used high-density electrocorticography (ECoG) recordings to detect when they heard or said an utterance and to then decode the utterance’s identity. Because certain answers were only plausible responses to certain questions, we could dynamically update the prior probabilities of each answer using the decoded question likelihoods as context. We decode produced and perceived utterances with accuracy rates as high as 61% and 76%, respectively (chance is 7% and 20%). Contextual integration of decoded question likelihoods significantly improves answer decoding. These results demonstrate real-time decoding of speech in an interactive, conversational setting, which has important implications for patients who are unable to communicate.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.