Phone recognition with hierarchical convolutional deep maxout networks

Tóth, László

doi:10.1186/s13636-015-0068-3

Cited by 75 publications

(45 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The results we obtain, although comparable with the stateof-the-art in semi-supervised learning, are not comparable with the current state-of-the-art in phone recognition on the TIMIT database which is 16.5% Phone Error Rate (or, equivalently, 83.5% accuracy) [38]. The reason for this is twofold: Firstly, the above results only use a fraction of the labels provided by the TIMIT database for training.…”

Section: Discussionmentioning

confidence: 66%

See 1 more Smart Citation

Sparse Autoencoder Based Semi-Supervised Learning for Phone Classification with Limited Annotations

Dhaka¹,

Salvi²

2017

GLU 2017 International Workshop on Grounding Language Understanding

View full text Add to dashboard Cite

We propose the application of a semi-supervised learning method to improve the performance of acoustic modelling for automatic speech recognition with limited linguistically annotated material. Our method combines sparse autoencoders with feed-forward networks, thus taking advantage of both unlabelled and labelled data simultaneously through mini-batch stochastic gradient descent. We tested the method with varying proportions of labelled vs unlabelled observations in framebased phoneme classification on the TIMIT database. Our experiments show that the method outperforms standard supervised models of similar complexity for an equal amount of labelled data and provides competitive error rates compared to state-of-the-art graph-based semi-supervised learning techniques.

show abstract

Section: Discussionmentioning

confidence: 66%

“…To our knowledge, the best performing method is based on hierarchical convolutional deep maxout networks and achieves 16.5% Phone Error Rate (or, equivalently, 83.5% accuracy) [38].…”

Section: Resultsmentioning

confidence: 99%

Sparse Autoencoder Based Semi-Supervised Learning for Phone Classification with Limited Annotations

Dhaka¹,

Salvi²

2017

GLU 2017 International Workshop on Grounding Language Understanding

View full text Add to dashboard Cite

show abstract

“…Decoding and evaluation was performed by applying a modified version of HTK [27]. We employed our custom neural network implementation, which achieved outstanding results earlier on several datasets (eg [29,30]). Following preliminary tests, we opted for five hidden layers, each one containing 1000 rectified neurons, and we applied the softmax activation function in the output layer.…”

Section: Methodsmentioning

confidence: 99%

Domain Adaptation of Deep Neural Networks for Automatic Speech Recognition via Wireless Sensors

Gosztolya

Grósz

2016

Journal of Electrical Engineering

View full text Add to dashboard Cite

Wireless sensors are recent, portable, low-powered devices, designed to record and transmit observations of their environment such as speech. To allow portability they are designed to have a small size and weight; this, however, along with their low power consumption, usually means that they have only quite basic recording equipment (e.g. microphone) installed. Recent speech technology applications typically require several dozen hours of audio recordings (nowadays even hundreds of hours is common), which is usually not available as recorded material by such sensors. Since systems trained with studiolevel utterances tend to perform suboptimally for such recordings, a sensible idea is to adapt models which were trained on existing, larger, noise-free corpora. In this study, we experimented with adapting Deep Neural Network-based acoustic models trained on noise-free speech data to perform speech recognition on utterances recorded by wireless sensors. In the end, we were able to achieve a 5% gain in terms of relative error reduction compared to training only on the sensor-recorded, restricted utterance subset.

show abstract

“…Lately, ASR systems have become much more accurate and robust thanks to deep neural networks (DNNs) [16][17][18]. We used scripts provided with the Kaldi toolkit [19] for training DNN-based ASR systems and the IRSTLM tool [20] for building language models.…”

Section: Asr System Development Detailsmentioning

confidence: 99%

Developing a unit selection voice given audio without corresponding text

Godambe

Rallabandi

Gangashetty

et al. 2016

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Today, a large amount of audio data is available on the web in the form of audiobooks, podcasts, video lectures, video blogs, news bulletins, etc. In addition, we can effortlessly record and store audio data such as a read, lecture, or impromptu speech on handheld devices. These data are rich in prosody and provide a plethora of voices to choose from, and their availability can significantly reduce the overhead of data preparation and help rapid building of synthetic voices. But, a few problems are associated with readily using this data such as (1) these audio files are generally long, and audio-transcription alignment is memory intensive; (2) precise corresponding transcriptions are unavailable, (3) many times, no transcriptions are available at all; (4) the audio may contain dis-fluencies and non-speech noises, since they are not specifically recorded for building synthetic voices; and (5) if we obtain automatic transcripts, they will not be error free. Earlier works on long audio alignment addressing the first and second issue generally preferred reasonable transcripts and mainly focused on (1) less manual intervention, (2) mispronunciation detection, and (3) segmentation error recovery. In this work, we use a large vocabulary public domain automatic speech recognition (ASR) system to obtain transcripts, followed by confidence measure-based data pruning which together address the five issues with the found data and also ensure the above three points. For proof of concept, we build voices in the English language using an audiobook (read speech) in a female voice from LibriVox and a lecture (spontaneous speech) in a male voice from Coursera, using both reference and hypotheses transcriptions, and evaluate them in terms of intelligibility and naturalness with the help of a perceptual listening test on the Blizzard 2013 corpus.

show abstract

Phone recognition with hierarchical convolutional deep maxout networks

Cited by 75 publications

References 37 publications

Sparse Autoencoder Based Semi-Supervised Learning for Phone Classification with Limited Annotations

Sparse Autoencoder Based Semi-Supervised Learning for Phone Classification with Limited Annotations

Domain Adaptation of Deep Neural Networks for Automatic Speech Recognition via Wireless Sensors

Developing a unit selection voice given audio without corresponding text

Contact Info

Product

Resources

About