Sage: The New BBN Speech Processing Platform

Hsiao, Roger; Meermeier, Ralf; Ng, Tim; Huang, Zhongqiang; Jordan, Maxwell; Kan, E. P.; Alumäe, Tanel; Silovský, Jan; Hartmann, William M.; Keith, Francis; Lang, Omer; Siu, Man-Hung; Kimball, Owen

doi:10.21437/interspeech.2016-1031

Cited by 14 publications

(8 citation statements)

References 9 publications

(6 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the Sage ASR toolkit [16] for all experiments. Sage is BBN's newly developed STT platform that integrates technologies from multiple sources, each of which has a particular strength.…”

Section: Methodsmentioning

confidence: 99%

Improved Multilingual Training of Stacked Neural Network Acoustic Models for Low Resource Languages

Alumäe¹,

Tsakalidis²,

Schwartz³

2016

Interspeech 2016

Self Cite

View full text Add to dashboard Cite

This paper proposes several improvements to multilingual training of neural network acoustic models for speech recognition and keyword spotting in the context of low-resource languages. We concentrate on the stacked architecture where the first network is used as a bottleneck feature extractor and the second network as the acoustic model. We propose to improve multilingual training when the amount of data from different languages is very different by applying balancing scalers to the training examples. We also explore how to exploit multilingual data to train the second neural network of the stacked architecture. An ensemble training method that can take advantage of both unsupervised pretraining as well as multilingual training is found to give the best speech recognition performance across a wide variety of languages, while system combination of differently trained multilingual models results in further improvements in keyword search performance.

show abstract

Section: Methodsmentioning

confidence: 99%

Improved Multilingual Training of Stacked Neural Network Acoustic Models for Low Resource Languages

Alumäe¹,

Tsakalidis²,

Schwartz³

2016

Interspeech 2016

Self Cite

View full text Add to dashboard Cite

show abstract

“…We use the Sage ASR toolkit [23]. Sage is BBN's newly developed STT platform that integrates technologies from multiple sources, each of which has a particular strength.…”

Section: Methodsmentioning

confidence: 99%

Two-Stage Data Augmentation for Low-Resourced Speech Recognition

Hartmann¹,

Ng²,

Hsiao³

et al. 2016

Interspeech 2016

Self Cite

View full text Add to dashboard Cite

Low resourced languages suffer from limited training data and resources. Data augmentation is a common approach to increasing the amount of training data. Additional data is synthesized by manipulating the original data with a variety of methods. Unlike most previous work that focuses on a single technique, we combine multiple, complementary augmentation approaches. The first stage adds noise and perturbs the speed of additional copies of the original audio. The data is further augmented in a second stage, where a novel fMLLR-based augmentation is applied to bottleneck features to further improve performance. A reduction in word error rate is demonstrated on four languages from the IARPA Babel program. We present an analysis exploring why these techniques are beneficial. Index Terms: speech recognition, deep neural networks, data augmentation Speed PerturbationKo et al. [18] showed success by manipulating the speed of the data. They demonstrated a performance improvement over the more common vocal tract length perturbation (VTLP) technique [8]. Using the Sox utility [19], the original data is perturbed by a warping factor that effects both the frequencies and the duration of the speech. The speed change is accomplished by resampling the waveform, which not only changes the duration, but also scales the pitch, vocal tract length, and all spectral frequencies by the same factor. Our setup uses a randomly selected warping factor between 0.9 and 1.1 (this was also the

show abstract

“…The ASR models in this paper are trained using BBN's speech recognition system, Sage [20], which makes use of the Kaldi speech recognition toolkit [21]. All of the models reported are hybrid TDNN-LSTMs, which are trained with alternating time-delay neural network (TDNN) layers and long short-term memory (LSTM) layers, as in [22].…”

Section: Acoustic Modelingmentioning

confidence: 99%

Improved Data Selection for Domain Adaptation in ASR

Wotherspoon

Hartmann

Snover

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Automatic speech recognition (ASR) systems are highly sensitive to train-test domain mismatch. However, because transcription is often prohibitively expensive, it is important to be able to make use of available transcribed out-of-domain data. We address the problem of domain adaptation with semi-supervised training (SST). Contrary to work in in-domain SST, we find significant performance improvement even with just one hour of target-domain data-though, the selection of the data is critical. We show that minimum phone error rate is a good oracle measure for selection, and we approximate this measure by using the average phone confidence of an utterance. With larger domain shifts, we also find that deletions and low lexical diversity are a serious issue, which we address by incorporating phone rate into our selection metric. With our proposed selection criterion, we see up to 57% relative improvements over the out-ofdomain baseline model. Furthermore, this selection method generalizes well, and matches or outperforms word-level confidence selection across six separate domain shift conditions.

show abstract

Sage: The New BBN Speech Processing Platform

Cited by 14 publications

References 9 publications

Improved Multilingual Training of Stacked Neural Network Acoustic Models for Low Resource Languages

Improved Multilingual Training of Stacked Neural Network Acoustic Models for Low Resource Languages

Two-Stage Data Augmentation for Low-Resourced Speech Recognition

Improved Data Selection for Domain Adaptation in ASR

Contact Info

Product

Resources

About