Deep maxout neural networks for speech recognition

Cai, Meng; Shi, Yongzhe; Liu, Jia

doi:10.1109/asru.2013.6707745

Cited by 67 publications

(30 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MNNs, instead of making a prior assumption about parametric form of non-linearity, try to build it automatically from a number of linear components. While this work was under review two additional papers were published on maxout activation for ASR [25,26]. As a result, contributions of this work overlap to some extent with one or the other and we will refer to those in text when necessary.…”

Section: Introductionmentioning

confidence: 99%

Investigation of maxout networks for speech recognition

Swietojanski

Huang

2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We explore the use of maxout neuron in various aspects of acoustic modelling for large vocabulary speech recognition systems; including low-resource scenario and multilingual knowledge transfers. Through the experiments on voice search and short message dictation datasets, we found that maxout networks are around three times faster to train and offer lower or comparable word error rates on several tasks, when compared to the networks with logistic nonlinearity. We also present a detailed study of the maxout unit internal behaviour suggesting the use of different nonlinearities in different layers.

show abstract

Section: Introductionmentioning

confidence: 99%

Investigation of maxout networks for speech recognition

Swietojanski

Huang

2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…This activation function can be regarded as a generalization of the rectifier function [16], and so far, only a few studies have attempted to apply maxout networks to speech recognition tasks. These all found that maxout nets slightly outperformed ReLU networks, in particular under lowresource conditions [17][18][19]. Here, we show that the pooling procedure applied in CNNs and the pooling step of the maxout function are practically the same, and hence, it is trivial to combine the two techniques and construct convolutional networks out of maxout neurons.…”

Section: Introductionmentioning

confidence: 92%

“…In our experiments with p-norm pooling, we set p to 2, following Zhang et al, but in our first tests, the group size was set to 2, which was found to be optimal for maxout networks [17][18][19]. Our tests quickly revealed that our pnorm implementation faces difficulties with propagating the error back to lower layers.…”

Section: Improvements To Maxoutmentioning

confidence: 99%

See 1 more Smart Citation

Phone recognition with hierarchical convolutional deep maxout networks

Tóth

2015

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Deep convolutional neural networks (CNNs) have recently been shown to outperform fully connected deep neural networks (DNNs) both on low-resource and on large-scale speech tasks. Experiments indicate that convolutional networks can attain a 10-15 % relative improvement in the word error rate of large vocabulary recognition tasks over fully connected deep networks. Here, we explore some refinements to CNNs that have not been pursued by other authors. First, the CNN papers published up till now used sigmoid or rectified linear (ReLU) neurons. We will experiment with the maxout activation function proposed recently, which has been shown to outperform the rectifier activation function in fully connected DNNs. We will show that the pooling operation of CNNs and the maxout function are closely related, and so the two technologies can be readily combined to build convolutional maxout networks. Second, we propose to turn the CNN into a hierarchical model. The origins of this approach go back to the era of shallow nets, where the idea of stacking two networks on each other was relatively well known. We will extend this method by fusing the two networks into one joint deep model with many hidden layers and a special structure. We will show that with the hierarchical modelling approach, we can reduce the error rate of the network on an expanded context of input. In the experiments on the Texas Instruments Massachusetts Institute of Technology (TIMIT) phone recognition task, we find that a CNN built from maxout units yields a relative phone error rate reduction of about 4.3 % over ReLU CNNs. Applying the hierarchical modelling scheme to this CNN results in a further relative phone error rate reduction of 5.5 %. Using dropout training, the lowest error rate we get on TIMIT is 16.5 %, which is currently the best result. Besides experimenting on TIMIT, we also evaluate our best models on a low-resource large vocabulary task, and we find that all the proposed modelling improvements give consistently better results for this larger database as well.

show abstract

“…For the English model, a DNN with 5 hidden layers and a softmax output layer with 41 units was trained on 700 hours of English data from Switchboard dataset and Fisher dataset. More details about DNN model training can be seen in [11].…”

Section: System Descriptionmentioning

confidence: 99%

Query-by-example spoken term detection based on phonetic posteriorgram

Song¹,

Zhang²,

Cai³

et al. 2015

Advances in Social Science, Education and Humanities Research

Self Cite

View full text Add to dashboard Cite

Abstract. Spoken term detection in low-resource situations is a challenging problem, because traditional large vocabulary continuous speech recognition (LVCSR) approaches are often unusable. This paper introduces a method to use deep neural network (DNN) softmax outputs as input features in a query-by-example (QBE) spoken term detection (STD) system. Matches between queries and test utterances are located using a modified dynamic time warping (DTW) search approach. Subsystems are built with unsupervised Gaussian mixture model (GMM) and DNN monophone models trained on Chinese and English languages and evaluated on the SWS 2013 multilingual database of low-resource languages. The score-level fusion of these different subsystems are shown to improve performance significantly over the baseline results.

show abstract

Deep maxout neural networks for speech recognition

Cited by 67 publications

References 12 publications

Investigation of maxout networks for speech recognition

Investigation of maxout networks for speech recognition

Phone recognition with hierarchical convolutional deep maxout networks

Query-by-example spoken term detection based on phonetic posteriorgram

Contact Info

Product

Resources

About