Transfer Learning for Speech Recognition on a Budget

Kunze, Julius; Kirsch, Louis; Kurenkov, Ilia; Krug, Andreas; Johannsmeier, Jens; Stober, Sebastian

doi:10.18653/v1/w17-2620

Cited by 105 publications

(57 citation statements)

References 16 publications

(17 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In natural language processing (NLP), unsupervised pre-training of language models (Devlin et al, 2018;Radford et al, 2018; improved many tasks such as text classification, phrase structure parsing and machine translation Lample & Conneau, 2019). In speech processing, pre-training has focused on emotion recogniton (Lian et al, 2018), speaker identification , phoneme discrimination (Synnaeve & Dupoux, 2016a;van den Oord et al, 2018) as well as transferring ASR representations from one language to another (Kunze et al, 2017). There has been work on unsupervised learning for speech but the resulting representations have not been applied to improve supervised speech recognition (Synnaeve & Dupoux, 2016b;Kamper et al, 2017;Chung et al, 2018;Chen et al, 2018;Chorowski et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

wav2vec: Unsupervised Pre-Training for Speech Recognition

Schneider¹,

Baevski²,

Collobert³

et al. 2019

Interspeech 2019

822

503

View full text Add to dashboard Cite

We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36 % when only a few hours of transcribed data is available. Our approach achieves 2.43 % WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data. 1

show abstract

Section: Introductionmentioning

confidence: 99%

wav2vec: Unsupervised Pre-Training for Speech Recognition

Schneider¹,

Baevski²,

Collobert³

et al. 2019

Interspeech 2019

822

503

View full text Add to dashboard Cite

show abstract

“…However, because speech signals are high-dimensional and highly variable even for a single speaker, training deep models and learning these hierarchical representations without a large amount of training data is difficult. The computer vision [15,16], natural language processing [17][18][19][20][21], and ASR [22][23][24][25] communities have attacked the problem of limited supervised training data with great success by pre-training deep models on related tasks for which there is more training data. Following their lead, we propose an efficient ASR-based pre-training methodology in this paper and show that it may be used to improve the performance of end-toend SLU models, especially when the amount of training data is very small.…”

Section: Introductionmentioning

confidence: 99%

Speech Model Pre-Training for End-to-End Spoken Language Understanding

Lugosch

Ravanelli

Ignoto³

et al. 2019

Interspeech 2019

192

268

View full text Add to dashboard Cite

Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-toend SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-toend models without a large amount of training data is difficult. We propose a method to reduce the data requirements of endto-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the model's ability to generalize to new phrases not heard during training.

show abstract

“…Out of them, fewshot techniques as proposed by [19,5] have become really popular. Pons et al [17] proposed a few-shot technique using prototypical networks [5] and transfer leaning [20,21] to solve a different audio task.…”

Section: Related Workmentioning

confidence: 99%

Prototypical Metric Transfer Learning for Continuous Speech Keyword Spotting with Limited Training Data

Seth¹,

Kumar²,

Srivastava³

2019

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Continuous Speech Keyword Spotting (CSKS) is the problem of spotting keywords in recorded conversations, when a small number of instances of keywords are available in training data. Unlike the more common Keyword Spotting, where an algorithm needs to detect lone keywords or short phrases like Alexa, Cortana, Hi Alexa!, Whatsup Octavia? etc. in speech, CSKS needs to filter out embedded words from a continuous flow of speech, ie. spot Anna and github in I know a developer named Anna who can look into this github issue. Apart from the issue of limited training data availability, CSKS is an extremely imbalanced classification problem. We address the limitations of simple keyword spotting baselines for both aforementioned challenges by using a novel combination of loss functions (Prototypical networks loss and metric loss) and transfer learning. Our method improves F1 score by over 10%.

show abstract

Transfer Learning for Speech Recognition on a Budget

Cited by 105 publications

References 16 publications

wav2vec: Unsupervised Pre-Training for Speech Recognition

wav2vec: Unsupervised Pre-Training for Speech Recognition

Speech Model Pre-Training for End-to-End Spoken Language Understanding

Prototypical Metric Transfer Learning for Continuous Speech Keyword Spotting with Limited Training Data

Contact Info

Product

Resources

About