Ken-ichi Iso scite author profile

Automatic speech recognition is becoming more ubiquitous as recognition performance improves, capable devices increase in number, and areas of new application open up. Neural network acoustic models that can utilize speaker-adaptive features, have deep and wide layers, or more computationally expensive architectures, for example, often obtain best recognition accuracy but may not be suitable for the given budget of computational and storage resources or latency required by the deployed system. We explore a straightforward training approach which takes advantage of highly accurate but expensive-to-evaluate neural network acoustic models by using their outputs to relabel training examples for easier-to-deploy models. Experiments on a large vocabulary continuous speech recognition task offer relative reductions in word error rate of up to 16.7 % over training with the hard aligned labels by effectively making use of large amounts of additional untranscribed data. Somewhat remarkably, the approach works well even when only two output classes are present. Experiments on a voice activity detection task give relative reductions in equal error rate of up to 11.5 % when using a convolutional neural network to relabel training examples for a feedforward neural network. An investigation into the hidden layer weight matrices finds that soft target-trained networks tend to produce weight matrices having fuller rank and slower decay in singular values than their hard target-trained counterparts, suggesting that more of the network's capacity is utilized for learning additional information giving better accuracy.

show abstract

Speaker-Independent Word Recognition Using a Neural Prediction Model

Iso¹,

Watanabe²

1990

View full text Add to dashboard Cite

Speaker clustering using vector quantization and spectral clustering

Iso¹

2010

View full text Add to dashboard Cite

Speaker-independent word recognition using dynamic programming neural networks

Sakoe

Isotani

Yoshida

et al.

View full text Add to dashboard Cite

This paper describes speaker-independent word recognition based on a new neural network model (Dynamic p r o gramming Neural Network; DNN), which can treat timesequence patterns. The proposed model, DNN, is based on the integration of multi-layer neural network and dynamic programming based matching. Speaker-independent is@ lated Japanese digit recognition experiments were carried out using data uttered by 107 speakers (50 speakers for training and 57 speakers for testing). As a result, 99.3% recognition accuracy was obtained. This suggests that the proposed model can be effective for speech recognition.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ken-ichi Iso

High speed speech recognition using tree-structured probability density function

Wise teachers train better DNN acoustic models

Speaker-Independent Word Recognition Using a Neural Prediction Model

Speaker clustering using vector quantization and spectral clustering

Speaker-independent word recognition using dynamic programming neural networks

Contact Info

Product

Resources

About