Abstract:To capitalize on the rapid development of Speech-to-Text (STT) technologies and the proliferation of open source machine learning toolkits, BBN has developed Sage, a new speech processing platform that integrates technologies from multiple sources, each of which has particular strengths. In this paper, we describe the design of Sage, which allows the easy interchange of STT components from different sources. We also describe our approach for fast prototyping with new machine learning toolkits, and a framework … Show more
“…We use the Sage ASR toolkit [16] for all experiments. Sage is BBN's newly developed STT platform that integrates technologies from multiple sources, each of which has a particular strength.…”
This paper proposes several improvements to multilingual training of neural network acoustic models for speech recognition and keyword spotting in the context of low-resource languages. We concentrate on the stacked architecture where the first network is used as a bottleneck feature extractor and the second network as the acoustic model. We propose to improve multilingual training when the amount of data from different languages is very different by applying balancing scalers to the training examples. We also explore how to exploit multilingual data to train the second neural network of the stacked architecture. An ensemble training method that can take advantage of both unsupervised pretraining as well as multilingual training is found to give the best speech recognition performance across a wide variety of languages, while system combination of differently trained multilingual models results in further improvements in keyword search performance.
“…We use the Sage ASR toolkit [16] for all experiments. Sage is BBN's newly developed STT platform that integrates technologies from multiple sources, each of which has a particular strength.…”
This paper proposes several improvements to multilingual training of neural network acoustic models for speech recognition and keyword spotting in the context of low-resource languages. We concentrate on the stacked architecture where the first network is used as a bottleneck feature extractor and the second network as the acoustic model. We propose to improve multilingual training when the amount of data from different languages is very different by applying balancing scalers to the training examples. We also explore how to exploit multilingual data to train the second neural network of the stacked architecture. An ensemble training method that can take advantage of both unsupervised pretraining as well as multilingual training is found to give the best speech recognition performance across a wide variety of languages, while system combination of differently trained multilingual models results in further improvements in keyword search performance.
“…We use the Sage ASR toolkit [23]. Sage is BBN's newly developed STT platform that integrates technologies from multiple sources, each of which has a particular strength.…”
Low resourced languages suffer from limited training data and resources. Data augmentation is a common approach to increasing the amount of training data. Additional data is synthesized by manipulating the original data with a variety of methods. Unlike most previous work that focuses on a single technique, we combine multiple, complementary augmentation approaches. The first stage adds noise and perturbs the speed of additional copies of the original audio. The data is further augmented in a second stage, where a novel fMLLR-based augmentation is applied to bottleneck features to further improve performance. A reduction in word error rate is demonstrated on four languages from the IARPA Babel program. We present an analysis exploring why these techniques are beneficial. Index Terms: speech recognition, deep neural networks, data augmentation
Speed PerturbationKo et al. [18] showed success by manipulating the speed of the data. They demonstrated a performance improvement over the more common vocal tract length perturbation (VTLP) technique [8]. Using the Sox utility [19], the original data is perturbed by a warping factor that effects both the frequencies and the duration of the speech. The speed change is accomplished by resampling the waveform, which not only changes the duration, but also scales the pitch, vocal tract length, and all spectral frequencies by the same factor. Our setup uses a randomly selected warping factor between 0.9 and 1.1 (this was also the
“…The ASR models in this paper are trained using BBN's speech recognition system, Sage [20], which makes use of the Kaldi speech recognition toolkit [21]. All of the models reported are hybrid TDNN-LSTMs, which are trained with alternating time-delay neural network (TDNN) layers and long short-term memory (LSTM) layers, as in [22].…”
Automatic speech recognition (ASR) systems are highly sensitive to train-test domain mismatch. However, because transcription is often prohibitively expensive, it is important to be able to make use of available transcribed out-of-domain data. We address the problem of domain adaptation with semi-supervised training (SST). Contrary to work in in-domain SST, we find significant performance improvement even with just one hour of target-domain data-though, the selection of the data is critical. We show that minimum phone error rate is a good oracle measure for selection, and we approximate this measure by using the average phone confidence of an utterance. With larger domain shifts, we also find that deletions and low lexical diversity are a serious issue, which we address by incorporating phone rate into our selection metric. With our proposed selection criterion, we see up to 57% relative improvements over the out-ofdomain baseline model. Furthermore, this selection method generalizes well, and matches or outperforms word-level confidence selection across six separate domain shift conditions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.