To capitalize on the rapid development of Speech-to-Text (STT) technologies and the proliferation of open source machine learning toolkits, BBN has developed Sage, a new speech processing platform that integrates technologies from multiple sources, each of which has particular strengths. In this paper, we describe the design of Sage, which allows the easy interchange of STT components from different sources. We also describe our approach for fast prototyping with new machine learning toolkits, and a framework for sharing STT components across different applications. Finally, we report Sage's state-of-the-art performance on different STT tasks.
This paper reports our recent progress on using multilingual data for improving speech-to-text (STT) systems that can be easily delivered. We continued the work BBN conducted on the use of multilingual data for improving Babel evaluation systems, but focused on training time-delay neural network (TDNN) based chain models. As done for the Babel evaluations, we used multilingual data in two ways: first, to train multilingual deep neural networks (DNN) for extracting bottleneck (BN) features, and second, for initializing training on target languages. Our results show that TDNN chain models trained on multilingual DNN bottleneck features yield significant gains over their counterparts trained on MFCC plus i-vector features. By initializing from models trained on multilingual data, TDNN chain models can achieve great improvements over random initializations of the network weights on target languages. Two other important findings are: 1) initialization with multilingual TDNN chain models produces larger gains on target languages that have less training data; 2) inclusion of target languages in multilingual training for either BN feature extraction or initialization have limited impact on performance measured on the target languages. Our results also reveal that for TDNN chain models, the combination of multilingual BN features and multilingual initialization achieves the best performance on all target languages.
On small datasets, discriminatively trained bottleneck features from deep networks commonly outperform more traditional spectral or cepstral features. While these features are typically trained with small, fully-connected networks, recent studies have used more sophisticated networks with great success. We use the recent deep CNN (VGG) network for bottleneck feature extraction-previously used only for low-resource tasksand apply it to the Switchboard English conversational telephone speech task. Unlike features derived from traditional MLP networks, the VGG features outperform cepstral features even when used with BLSTM acoustic models trained on large amounts of data. We achieve the best BBN single system performance when combining the VGG features with a BLSTM acoustic model. When decoding with an n-gram language model, which are used for deployable systems, we have a realistic production system with a WER of 7.4%. This result is competitive with the current state-of-the-art in the literature. While our focus is on realistic single system performance, we further reduce the WER to 6.1% through system combination and using expensive neural network language model rescoring.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.