Sequence-to-sequence models have shown success in end-to-end speech recognition. However these models have only used shallow acoustic encoder networks. In our work, we successively train very deep convolutional networks to add more expressive power and better generalization for end-to-end ASR models. We apply network-in-network principles, batch normalization, residual connections and convolutional LSTMs to build very deep recurrent and convolutional structures. Our models exploit the spectral structure in the feature space and add computational depth without overfitting issues. We experiment with the WSJ ASR task and achieve 10.5% word error rate without any dictionary or language using a 15 layer deep network.
This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work. The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts. Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers. The corpus is freely available for download from http://www.openslr.org/60/.
Common spatial pattern (CSP)-based spatial filtering has been most popularly applied to electroencephalogram (EEG) feature extraction for motor imagery (MI) classification in brain-computer interface (BCI) application. The effectiveness of CSP is highly affected by the frequency band and time window of EEG segments. Although numerous algorithms have been designed to optimize the spectral bands of CSP, most of them selected the time window in a heuristic way. This is likely to result in a suboptimal feature extraction since the time period when the brain responses to the mental tasks occurs may not be accurately detected. In this paper, we propose a novel algorithm, namely temporally constrained sparse group spatial pattern (TSGSP), for the simultaneous optimization of filter bands and time window within CSP to further boost classification accuracy of MI EEG. Specifically, spectrum-specific signals are first derived by bandpass filtering from raw EEG data at a set of overlapping filter bands. Each of the spectrum-specific signals is further segmented into multiple subseries using sliding window approach. We then devise a joint sparse optimization of filter bands and time windows with temporal smoothness constraint to extract robust CSP features under a multitask learning framework. A linear support vector machine classifier is trained on the optimized EEG features to accurately identify the MI tasks. An experimental study is implemented on three public EEG datasets (BCI Competition III dataset IIIa, BCI Competition IV datasets IIa, and BCI Competition IV dataset IIb) to validate the effectiveness of TSGSP in comparison to several other competing methods. Superior classification performance (averaged accuracies are 88.5%, 83.3%, and 84.3% for the three datasets, respectively) based on the experimental results confirms that the proposed algorithm is a promising candidate for performance improvement of MI-based BCIs.
Antidepressants are widely prescribed, but their efficacy relative to placebo is modest, in part because the clinical diagnosis of major depression encompasses biologically heterogeneous conditions. Here, we sought to identify a neurobiological signature of response to antidepressant treatment as compared to placebo. We designed a latent-space machine learning algorithm tailored for resting-state electroencephalography (rsEEG) and applied it to data from the largest imaging-coupled, placebo-controlled antidepressant study (n=309). Symptom improvement was robustly predicted in a manner both specific for the antidepressant sertraline (versus placebo) and generalizable across different study sites and EEG equipment. This sertraline-predictive EEG signature generalized to two depression samples, wherein it reflected general antidepressant medication responsivity, and related differentially to repetitive transcranial magnetic stimulation (rTMS) treatment outcome. Furthermore, we found that the sertraline rsEEG signature indexed prefrontal neural responsivity, as measured by concurrent TMS/EEG. Our findings advance the neurobiological understanding of antidepressant treatment through an EEG-tailored computational model and provide a clinical avenue for personalized treatment of depression.
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pretrained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%. * Equal contribution.Preprint. Under review.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.