Acoustic features of speech include various spectral and temporal cues. It is known that temporal envelope plays a critical role for speech recognition by human listeners, while automated speech recognition (ASR) heavily relies on spectral analysis. This study compared sentence-recognition scores of humans and an ASR software, Dragon, when spectral and temporal-envelope cues were manipulated in background noise. Temporal fine structure of meaningful sentences was reduced by noise or tone vocoders. Three types of background noise were introduced: a white noise, a time-reversed multi-talker noise, and a fake-formant noise. Spectral information was manipulated by changing the number of frequency channels. With a 20-dB signal-to-noise ratio (SNR) and four vocoding channels, white noise had a stronger disruptive effect than the fake-formant noise. The same observation with 22 channels was made when SNR was lowered to 0 dB. In contrast, ASR was unable to function with four vocoding channels even with a 20-dB SNR. Its performance was least affected by white noise and most affected by the fake-formant noise. Increasing the number of channels, which improved the spectral resolution, generated nonmonotonic behaviors for the ASR with white noise but not with colored noise. The ASR also showed highly improved performance with tone vocoders. It is possible that fake-formant noise affected the software's performance by disrupting spectral cues, whereas white noise affected performance by compromising speech seg-mentation. Overall, these results suggest that human listeners and ASR utilize different listening strategies in noise.
Decoding spatial attention based on brain signals has wide applications in brain–computer interface (BCI). Previous BCI systems mostly relied on visual patterns or auditory stimulation (e.g., loudspeakers) to evoke synchronous brain signals. There would be difficulties to cover a large range of spatial locations with such a stimulation protocol. The present study explored the possibility of using virtual acoustic space and a visual‐auditory matching paradigm to overcome this issue. The technique has the flexibility of generating sound stimulation from virtually any spatial location. Brain signals of eight human subjects were obtained with a 32‐channel Electroencephalogram (EEG). Two amplitude‐modulated noise or speech sentences carrying distinct spatial information were presented concurrently. Each sound source was tagged with a unique modulation phase so that the phase of the recorded EEG signals indicated the sound being attended to. The phase‐tagged sound was further filtered with head‐related transfer functions to create the sense of virtual space. Subjects were required to pay attention to the sound source that best matched the location of a visual target. For all the subjects, the phase of a single sound could be accurately reflected over the majority of electrodes based on EEG responses of 90 s or less. The electrodes providing significant decoding performance on auditory attention were fewer and may require longer EEG responses. The reliability and efficiency of decoding with a single electrode varied with subjects. Overall, the virtual acoustic space protocol has the potential of being used in practical BCI systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.