Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition

Himawan, Ivan; Motlíček, Petr; Potard, Blaise; Kim, Nam Hoon; Lee, Jaewon

doi:10.1109/icassp.2015.7178830

Cited by 31 publications

(21 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The results for systems without applying fMLLR have been previously reported in [22]. Compared to the baseline performance, BN-based system improves the performance on SDM while trained on IHM data by 12.6% absolute WER (from 76.0% to 63.4%; 16.5% relative), whilst a minor degradation of 1.5% absolute (4.5% relative) is observed on the matched condition.…”

Section: Single-condition Mapping Using Sdmmentioning

confidence: 70%

“…Our previous work showed that SDM system trained using alignment generated from IHM (clean) ASR system provided significantly better performance [22], compared to SDM system trained using alignment from SDM. Since SDM data are synchronized with IHM data (on a frame-level), the SDM models are trained using HMM state alignments generated for IHM recordings.…”

Section: Experimental Data and Setupmentioning

confidence: 99%

“…This paper considers a model-based combination of multiple microphones. Our previous work of single channel mapping in [22] is extended, and results are compared with conventional speech enhancement techniques for distant large vocabulary speech recognition. Further, we investigate fM-LLR transform for speaker adaptation when it is used within the feature mapping framework, and show that the feature mapping is complementary to fMLLR feature space adaptation.…”

Section: Feature Mapping Techniques Using Dnnmentioning

confidence: 99%

“…Compared to our previous work in [22], the first fM-LLR transform is applied to the input features prior to training the first DNN. Hence, we use speakernormalized distant-talking speech features as input for the mapping procedure.…”

Section: Dnn-based Feature Mappingmentioning

confidence: 99%

“…The DNN is used to map the noisy and reverberant features to the BN-based features extracted from the close-talking input. Once the mapping is completed, the transformed BN features are extracted for training a new acoustic model [22]. The model-based combination of multiple microphones using the transformed BN features is proposed to integrate the multi-channel inputs for acoustic modeling.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Feature mapping using far-field microphones for distant speech recognition

Himawan

Motlíček

Sridharan

2016

Speech Communication

Self Cite

View full text Add to dashboard Cite

Acoustic modeling based on deep architectures has recently gained remarkable success, with substantial improvement of speech recognition accuracy in several automatic speech recognition (ASR) tasks. For distant speech recognition, the multi-channel deep neural network based approaches rely on the powerful modeling capability of deep neural network (DNN) to learn suitable representation of distant speech directly from its multi-channel source. In this model-based combination of multiple microphones, features from each channel are concatenated and used together as an input to DNN. This allows integrating the multi-channel audio for acoustic modeling without any pre-processing steps. Despite powerful modeling capabilities of DNN, an environmental mismatch due to noise and reverberation may result in severe performance degradation when features are simply fed to a DNN without a feature enhancement step. In this paper, we introduce the nonlinear bottleneck feature mapping approach using DNN, to transform the noisy and reverberant features to its clean version. The bottleneck features trained on clean signal are used as a teacher signal because they contain relevant information to phoneme classification, and the mapping is performed with the objective of suppressing noise and reverberation. The individual and combined impacts of beamforming and speaker adaptation techniques along with the feature mapping are examined for distant large vocabulary speech recognition, using a single and multiple far-field microphones. As an alternative to beamforming, experiments with concatenating multiple channel features are conducted. The experimental results on the AMI meeting corpus show that the feature mapping, used in combination with beamforming and speaker adaptation yields a distant speech recognition performance below 50% word error rate (WER), using DNN for acoustic modeling.

show abstract

Section: Single-condition Mapping Using Sdmmentioning

confidence: 70%

Section: Experimental Data and Setupmentioning

confidence: 99%

Section: Feature Mapping Techniques Using Dnnmentioning

confidence: 99%

Section: Dnn-based Feature Mappingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations