A hierarchical, context-dependent neural network architecture for improved phone recognition

Tóth, László

doi:10.1109/icassp.2011.5947489

Cited by 9 publications

(19 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The decision tree-based state clustering tool of HTK produced 858 tied states, and evaluating the phone models in forced alignment mode yielded the training targets for each frame of speech. The tied state set we applied here is the same as that used in our earlier study [25].…”

Section: Baseline Results On Timitmentioning

confidence: 99%

“…We found earlier that applying CD states as the network training targets is useful even for such a small corpus as TIMIT [25]. Hence, in all the experiments reported here, we used a tied state set that was obtained by training a conventional CD-HMM (using HTK).…”

Section: Baseline Results On Timitmentioning

confidence: 99%

“…With the bottleneck approach, we successfully trained a hierarchical system with context-dependent state targets [25]. Another possible way of improving hierarchical systems is to downsample the output of the lower network [46].…”

Section: Hierarchical Modellingmentioning

confidence: 99%

“…Some authors observed that the posterior estimates obtained can be "enhanced" by training yet another network-but this time on a sequence of output vectors coming from the first network [22]. Other authors refer to this approach as the "hierarchical modeling" [23][24][25] or the "stacked modeling" method [26]. Two trivial improvements to this approach are when the upper net downsamples the output of the lower one [24,27] and/or when it uses the output of some bottleneck layer instead of the uppermost softmax layer [25,26].…”

Section: Introductionmentioning

confidence: 99%

“…Other authors refer to this approach as the "hierarchical modeling" [23][24][25] or the "stacked modeling" method [26]. Two trivial improvements to this approach are when the upper net downsamples the output of the lower one [24,27] and/or when it uses the output of some bottleneck layer instead of the uppermost softmax layer [25,26]. Veselý's proposal was to treat this hierarchical construct as one joint model, and he also explained why the compound structure can be interpreted as a deep convolutional network [21].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Phone recognition with hierarchical convolutional deep maxout networks

Tóth

2015

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Deep convolutional neural networks (CNNs) have recently been shown to outperform fully connected deep neural networks (DNNs) both on low-resource and on large-scale speech tasks. Experiments indicate that convolutional networks can attain a 10-15 % relative improvement in the word error rate of large vocabulary recognition tasks over fully connected deep networks. Here, we explore some refinements to CNNs that have not been pursued by other authors. First, the CNN papers published up till now used sigmoid or rectified linear (ReLU) neurons. We will experiment with the maxout activation function proposed recently, which has been shown to outperform the rectifier activation function in fully connected DNNs. We will show that the pooling operation of CNNs and the maxout function are closely related, and so the two technologies can be readily combined to build convolutional maxout networks. Second, we propose to turn the CNN into a hierarchical model. The origins of this approach go back to the era of shallow nets, where the idea of stacking two networks on each other was relatively well known. We will extend this method by fusing the two networks into one joint deep model with many hidden layers and a special structure. We will show that with the hierarchical modelling approach, we can reduce the error rate of the network on an expanded context of input. In the experiments on the Texas Instruments Massachusetts Institute of Technology (TIMIT) phone recognition task, we find that a CNN built from maxout units yields a relative phone error rate reduction of about 4.3 % over ReLU CNNs. Applying the hierarchical modelling scheme to this CNN results in a further relative phone error rate reduction of 5.5 %. Using dropout training, the lowest error rate we get on TIMIT is 16.5 %, which is currently the best result. Besides experimenting on TIMIT, we also evaluate our best models on a low-resource large vocabulary task, and we find that all the proposed modelling improvements give consistently better results for this larger database as well.

show abstract

Section: Baseline Results On Timitmentioning

confidence: 99%

Section: Baseline Results On Timitmentioning

confidence: 99%