2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2015
DOI: 10.1109/asru.2015.7404810
|View full text |Cite
|
Sign up to set email alerts
|

Multi-task joint-learning of deep neural networks for robust speech recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
17
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 22 publications
(17 citation statements)
references
References 18 publications
0
17
0
Order By: Relevance
“…The work in Tang et al [Tang, Li and Wang (2016)] integrated speaker recognition and speech recognition into a multi-task learning framework using a recursive structure which attempts to use a unified model to simultaneously identify the two work. The work in Qian et al [Qian, Yin, You et al (2015)] combined two different DNNs (one for feature denoising and one for acoustic modeling) into a complete multi-task framework, in which all parameters can be used in real multi-task mode with two criteria training from scratch. The work in [Thanda and Venkatesan (2017)] combined the speaker's lip visual information with the audio input for speech recognition to learn the mapping of an audio-visual fusion feature and the frame label obtained from the GMM/HMM acoustic model, in which the secondary task is mapping visual features to frame labels derived from another GMM/HMM model.…”
Section: Related Workmentioning
confidence: 99%
“…The work in Tang et al [Tang, Li and Wang (2016)] integrated speaker recognition and speech recognition into a multi-task learning framework using a recursive structure which attempts to use a unified model to simultaneously identify the two work. The work in Qian et al [Qian, Yin, You et al (2015)] combined two different DNNs (one for feature denoising and one for acoustic modeling) into a complete multi-task framework, in which all parameters can be used in real multi-task mode with two criteria training from scratch. The work in [Thanda and Venkatesan (2017)] combined the speaker's lip visual information with the audio input for speech recognition to learn the mapping of an audio-visual fusion feature and the frame label obtained from the GMM/HMM acoustic model, in which the secondary task is mapping visual features to frame labels derived from another GMM/HMM model.…”
Section: Related Workmentioning
confidence: 99%
“…This ensures that the noise types encountered in the test utterances were unseen during the training phase. We created two multi-conditioned training set by adding 100 types of environmental noises [13], and 11 types of noises from Noisex noise database [14] with the SNR of 0-15 dB with 5 dB of increment, following [15]. We removed the babel noise from Noisex since it is present in the test set.…”
Section: Database Descriptionmentioning
confidence: 99%
“…While the model with direct enhancement block without T-F masking gives 11.20% WER. It has been shown in [15] that by simply adding more layers to DNN baseline does not improve the ASR performance significantly. The results from the model with direct enhancement block seems to confirm this.…”
Section: Network Architecture and Trainingmentioning
confidence: 99%
“…Therefore, a lot studies has been achieved on introducing techniques evolved for them to every other. These include speaker edition, speaker adaptive education and commonplace historical past version for SRE [1] [2].…”
Section: Introductionmentioning
confidence: 99%