Improved feature processing for deep neural networks

Rath, Shakti P.; Povey, Daniel; Veselý, Karel; Černocký, Jaň

doi:10.21437/interspeech.2013-48

Cited by 113 publications

(19 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We choose the Kaldi Toolkit [29] as ASR back-end system to evaluate the DNN-HMM hybrid system on the 8-channel RE-VERB Challenge task [30] (WSJ0 trigram 5k language model, circular microphone array with a microphone spacing of 8 cm). As first step, a GMM-HMM system is trained on the clean WSJCAM0 Cambridge Read News REVERB corpus [31] with feature extraction following the Type-I creation in [32], which is state-of-the art in the Kaldi recipe [29]. Then, we create a stateframe alignment to train the DNN on the multi-condition training sets (each of 7861 utterances) provided by the REVERB challenge [30].…”

Section: Methodsmentioning

confidence: 99%

An improved uncertainty decoding scheme with weighted samples for multi-channel DNN-HMM hybrid systems

Huemmer

Astudillo

Kellermann

2017

2017 Hands-Free Speech Communications and Microphone Arrays (HSCMA)

View full text Add to dashboard Cite

In this paper, we advance a recently-proposed uncertainty decoding scheme for DNN-HMM (deep neural network -hidden Markov model) hybrid systems. This numerical sampling concept averages DNN outputs produced by a finite set of feature samples (drawn from a probabilistic distortion model) to approximate the posterior likelihoods of the context-dependent HMM states. As main innovation, we propose a weighted DNN-output averaging based on a minimum classification error criterion and apply it to a probabilistic distortion model for spatial diffuseness features. The experimental evaluation is performed on the 8-channel REVERB Challenge task using a DNN-HMM hybrid system with multichannel front-end signal enhancement. We show that the recognition accuracy of the DNN-HMM hybrid system improves by incorporating uncertainty decoding based on random sampling and that the proposed weighted DNN-output averaging further reduces the word error rate scores.

show abstract

Section: Methodsmentioning

confidence: 99%

An improved uncertainty decoding scheme with weighted samples for multi-channel DNN-HMM hybrid systems

Huemmer

Astudillo

Kellermann

2017

2017 Hands-Free Speech Communications and Microphone Arrays (HSCMA)

View full text Add to dashboard Cite

show abstract

“…After monophone and triphone training, Mel Frequency Cepstral Coefficients (MFCCs) are processed with Linear Discriminant Analysis (LDA) and a Maximum Likelihood Linear Transform (MLLT). This is followed by Speaker Adaptive Training (SAT) with feature-space MLLR (fMLLR) [27,28]. This HMM-GMM system is denoted Baseline in Table 2.…”

Section: Acoustic Model Training and Evaluationmentioning

confidence: 99%

Ultrasound Tongue Imaging for Diarization and Alignment of Child Speech Therapy Sessions

Ribeiro¹,

Eshky²,

Richmond³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

We investigate the automatic processing of child speech therapy sessions using ultrasound visual biofeedback, with a specific focus on complementing acoustic features with ultrasound images of the tongue for the tasks of speaker diarization and time-alignment of target words. For speaker diarization, we propose an ultrasound-based time-domain signal which we call estimated tongue activity. For word-alignment, we augment an acoustic model with low-dimensional representations of ultrasound images of the tongue, learned by a convolutional neural network. We conduct our experiments using the Ultrasuite repository of ultrasound and speech recordings for child speech therapy sessions. For both tasks, we observe that systems augmented with ultrasound data outperform corresponding systems using only the audio signal.

show abstract

“…After monophone and triphone training, input features are processed with Linear Discriminant Analysis (LDA) and a Maximum Likelihood Linear Transform (MLLT). This is followed by Speaker Adaptive Training (SAT) with feature-space MLLR (fMLLR [27]). In the speaker-dependent scenario, each recording session is treated as a separate speaker for SAT.…”

Section: Systems and Resultsmentioning

confidence: 99%

Silent versus Modal Multi-Speaker Speech Recognition from Ultrasound and Video

et al. 2021

View full text Add to dashboard Cite

We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer than that of modal speech, and that silent speech covers a smaller articulatory space than modal speech. Although these two properties are statistically significant across speaking modes, they do not directly correlate with word error rates from speech recognition.

show abstract

Improved feature processing for deep neural networks

Cited by 113 publications

References 11 publications

An improved uncertainty decoding scheme with weighted samples for multi-channel DNN-HMM hybrid systems

An improved uncertainty decoding scheme with weighted samples for multi-channel DNN-HMM hybrid systems

Ultrasound Tongue Imaging for Diarization and Alignment of Child Speech Therapy Sessions

Silent versus Modal Multi-Speaker Speech Recognition from Ultrasound and Video

Contact Info

Product

Resources

About