2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2014
DOI: 10.1109/icassp.2014.6853589
|View full text |Cite
|
Sign up to set email alerts
|

Improving deep neural network acoustic models using generalized maxout networks

Abstract: Recently, maxout networks have brought significant improvements to various speech recognition and computer vision tasks. In this paper we introduce two new types of generalized maxout units, which we call p-norm and soft-maxout. We investigate their performance in Large Vocabulary Continuous Speech Recognition (LVCSR) tasks in various languages with 10 hours and 60 hours of data, and find that the p-norm generalization of maxout consistently performs well. Because, in our training setup, we sometimes see insta… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
141
0
1

Year Published

2015
2015
2019
2019

Publication Types

Select...
5
3
2

Relationship

2
8

Authors

Journals

citations
Cited by 240 publications
(144 citation statements)
references
References 15 publications
2
141
0
1
Order By: Relevance
“…We use the Kaldi speech recognition tools (Povey et al, 2011) to build our Spanish ASR systems. Our state-of-the-art ASR system is the p-norm DNN system of (Zhang et al, 2014). The worderror-rates on the dev and test sets of the Fisher dataset (dev, dev-2, test) are 29.80%, 29.79% and 25.30% respectively.…”
Section: Resultsmentioning
confidence: 99%
“…We use the Kaldi speech recognition tools (Povey et al, 2011) to build our Spanish ASR systems. Our state-of-the-art ASR system is the p-norm DNN system of (Zhang et al, 2014). The worderror-rates on the dev and test sets of the Fisher dataset (dev, dev-2, test) are 29.80%, 29.79% and 25.30% respectively.…”
Section: Resultsmentioning
confidence: 99%
“…For the language model we used a pruned version of the standard trigram language model that is distributed with the WSJ corpus. The acoustic models, referred to as SAT in the tables, are speaker-adapted GMM models [18,19], and those referred to as DNN, are based on deep neural networks with p-norm non-linearities [23], trained and tested on top of fMLLR features. The models estimated on LibriSpeech's training data are named after the amount of audio they were built on.…”
Section: Methodsmentioning
confidence: 99%
“…Variations of ReLU, such as leaky ReLU [41], parametric ReLU [42], and exponential LU [43] have also been explored for improved accuracy. Finally, a non-linearity called maxout, which takes the max value of two intersecting linear functions, has shown to be effective in speech recognition tasks [44,45].…”
Section: A Convolutional Neural Network (Cnns)mentioning
confidence: 99%