Semi-Supervised DNN Training with Word Selection for ASR

Veselý, Karel; Burget, Lukáš; Černocký, Jaň

doi:10.21437/interspeech.2017-1385

Cited by 21 publications

(24 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then, we retrained the model with additional embedding layer g(•) using the semi-supervised loss with paired si84 and unpaired si284 with Algorithm 1. As seen in [8]- [10], we observed that the retraining always results better than training from random weights. We searched the best hyperparameters α, β ∈ [0.5, 0.9] on the dev93 set.…”

Section: Settingssupporting

confidence: 62%

See 1 more Smart Citation

Semi-Supervised End-to-End Speech Recognition

Karita¹,

Watanabe

Iwata³

et al. 2018

Interspeech 2018

View full text Add to dashboard Cite

We propose a novel semi-supervised method for end-to-end automatic speech recognition (ASR). It can exploit large unpaired speech and text datasets, which require much less human effort to create paired speech-to-text datasets. Our semi-supervised method targets the extraction of an intermediate representation between speech and text data using a shared encoder network. Autoencoding of text data with this shared encoder improves the feature extraction of text data as well as that of speech data when the intermediate representations of speech and text are similar to each other as an inter-domain feature. In other words, by combining speech-to-text and text-to-text mappings through the shared network, we can improve speech-to-text mapping by learning to reconstruct the unpaired text data in a semisupervised end-to-end manner. We investigate how to design suitable inter-domain loss, which minimizes the dissimilarity between the encoded speech and text sequences, which originally belong to quite different domains. The experimental results we obtained with our proposed semi-supervised training shows a larger character error rate reduction from 15.8% to 14.4% than a conventional language model integration on the Wall Street Journal dataset.

show abstract

Section: Settingssupporting

confidence: 62%

“…According to [5], careful transcription costs 20 hours of human effort to create paired text for each hour of speech. To reduce the need for such hard effort, many researchers have developed semi-supervised training methods for ASR systems [6]- [10] because this way we can easily obtain a lot of unpaired data without such effort.…”

Section: Introductionmentioning

confidence: 99%

Semi-Supervised End-to-End Speech Recognition

Karita¹,

Watanabe

Iwata³

et al. 2018

Interspeech 2018

View full text Add to dashboard Cite

show abstract

“…In our scenario we consider adding the 'CC-untran' data (untranscribed data from the target domain = contact centers), or the 'Parl' data (imperfectly transcribed parliament data from different domain). We train either with 'masking' (scaling gradients in NN training with 0/1 per-frame weights) [7], or with 're-segmentatation' (selecting sub-segments with reliable transcripts) [6]. We begin with constructing a seed system, which we use for decoding automatic transcripts and filtering the imperfect transcripts.…”

Section: Methodsmentioning

confidence: 99%

“…With the untranscribed data, the situation is more difficult. The data can be used in the semi-supervised training [7,3,8,9,10,11,12,13]. Here, we need to identify the most reliable parts of the automatically generated transcripts to be included into the training.…”

Section: Introductionmentioning

confidence: 99%

“…This is typically done according to a lattice-based confidence [14,7], or eventually according to 'agreement analysis' [15] if multiple ASR systems are available. In our work, we build on top of the word-selection recipe from [7], while we experiment with data selection by framemasking (scaling gradients with per-frame 0/1 weights), or data re-segmentation (selecting the sub-segments with reliable transcripts). We also focus on sMBR training with untranscribed data, and replicate the experiment with two output layers from [16].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Lightly Supervised vs. Semi-supervised Training of Acoustic Model on Luxembourgish for Low-resource Automatic Speech Recognition

Veselý

Segura²,

Szöke

et al. 2018

Interspeech 2018

Self Cite

View full text Add to dashboard Cite

In this work, we focus on exploiting 'inexpensive' data in order to to improve the DNN acoustic model for ASR. We explore two strategies: The first one uses untranscribed data from the target domain. The second one is related to the proper selection of excerpts from imperfectly transcribed out-of-domain public data, as parliamentary speeches. We found out that both approaches lead to similar results, making them equally beneficial for practical use. The Luxembourgish ASR seed system had a 38.8% WER and it improved by roughly 4% absolute, leading to 34.6% for untranscribed and 34.9% for lightlysupervised data. Adding both databases simultaneously led to 34.4% WER, which is only a small improvement. As a secondary research topic, we experiment with semi-supervised state-level minimum Bayes risk (sMBR) training. Nonetheless, for sMBR we saw no improvement from adding the automatically transcribed target data, despite that similar techniques yield good results in the case of cross-entropy (CE) training.

show abstract

Automatic Speech Recognition Model Adaptation to Medical Domain Using Untranscribed Audio

Salimbajevs

Kapočiūtė-Dzikienė

2022

Communications in Computer and Information Science

View full text Add to dashboard Cite

Semi-Supervised DNN Training with Word Selection for ASR

Cited by 21 publications

References 21 publications

Semi-Supervised End-to-End Speech Recognition

Semi-Supervised End-to-End Speech Recognition

Lightly Supervised vs. Semi-supervised Training of Acoustic Model on Luxembourgish for Low-resource Automatic Speech Recognition

Automatic Speech Recognition Model Adaptation to Medical Domain Using Untranscribed Audio

Contact Info

Product

Resources

About