wav2vec: Unsupervised Pre-training for Speech Recognition

Schneider, Steffen; Baevski, Alexei; Collobert, Ronan; Auli, Michael

doi:10.48550/arxiv.1904.05862

Cited by 151 publications

(230 citation statements)

References 0 publications

Supporting

Mentioning

208

Contrasting

Unclassified

Order By: Relevance

“…However, performance of the acoustic model can further improve by deploying more robust input features other than MFCC. In the final section, we evaluate the proposed method trained on noise-invariant Wav2Vec features [34]. Wav2Vec representation has been trained on large amounts of unlabeled audio data in an unsupervised man-ner.…”

Section: Experiments and Resultsmentioning

confidence: 99%

Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

Yousefi¹

2021

Preprint

View full text Add to dashboard Cite

This study addresses the problem of single-channel Automatic Speech Recognition of a target speaker within an overlap speech scenario. In the proposed method, the hidden representations in the acoustic model are modulated by speaker auxiliary information to recognize only the desired speaker. Affine transformation layers are inserted into the acoustic model network to integrate speaker information with the acoustic features. The speaker conditioning process allows the acoustic model to perform computation in the context of target-speaker auxiliary information. The proposed speaker conditioning method is a general approach and can be applied to any acoustic model architecture. Here, we employ speaker conditioning on a ResNet acoustic model. Experiments on the WSJ corpus show that the proposed speaker conditioning method is an effective solution to fuse speaker auxiliary information with acoustic features for multi-speaker speech recognition, achieving +9% and +20% relative WER reduction for clean and overlap speech scenarios, respectively, compared to the original ResNet acoustic model baseline.

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

Yousefi¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Correspondence to: Pilsung Kang <pilsung kang@korea.ac.kr>. recognition (Chen et al, 2020), and auto speech recognition (Schneider et al, 2019;Baevski et al, 2019;. The Wav2vec 2.0 model (Baevski et al, 2020) is an end-to-end framework of self-supervised learning for automatic speech recognition (ASR), and it has recently been presented as an effective pre-training method to learn speech representations.…”

Section: Introductionmentioning

confidence: 99%

K-Wav2vec 2.0: Automatic Speech Recognition based on Joint Decoding of Graphemes and Syllables

Kim¹,

Kang²

2021

Preprint

View full text Add to dashboard Cite

Wav2vec 2.0 is an end-to-end framework of selfsupervised learning for speech representation that is successful in automatic speech recognition (ASR), but most of the work on the topic has been developed with a single language: English. Therefore, it is unclear whether the self-supervised framework is effective in recognizing other languages with different writing systems, such as Korean which uses the Hangul having a unique writing system. In this paper, we present K-Wav2Vec 2.0, which is a modified version of Wav2vec 2.0 designed for Korean automatic speech recognition by exploring and optimizing various factors of the original Wav2vec 2.0. In fine-tuning, we propose a multi-task hierarchical architecture to reflect the Korean writing structure. Moreover, a joint decoder is applied to alleviate the problem of words existing outside of the vocabulary. In pre-training, we attempted the cross-lingual transfer of the pre-trained model by further pretraining the English Wav2vec 2.0 on a Korean dataset, considering limited resources. Our experimental results demonstrate that the proposed method yields the best performance on both Korean ASR datasets: Ksponspeech (a large-scale Korean speech corpus) and Clovacall (a call-based dialog corpus). Further pre-training is also effective in language adaptation, leading to large improvements without additional data.

show abstract

“…Considering the complex ATC environment, the handcrafted feature engineering may not be an optimal option for ASR tasks. Therefore, the learning mechanism was proposed to learn informative and discriminative features from raw waveforms, which achieved desired performance improvement for common ASR applications, such as Sinc-Net, wav2vec [7], [8].…”

Section: Introductionmentioning

confidence: 99%

Speech recognition for air traffic control via feature learning and end-to-end training

Fan¹,

Guo²,

Lin³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this work, we propose a new automatic speech recognition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems. The proposed model integrates the feature learning block, recurrent neural network (RNN), and connectionist temporal classification loss to build an end-to-end ASR model. Facing the complex environments of ATC speech, instead of the handcrafted features, a learning block is designed to extract informative features from raw waveforms for acoustic modeling. Both the SincNet and 1D convolution blocks are applied to process the raw waveforms, whose outputs are concatenated to the RNN layers for the temporal modeling. Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner, i.e., from waveform to text. Finally, the multilingual issue in the ATC domain is also considered to achieve the ASR task by constructing a combined vocabulary of Chinese characters and English letters. The proposed approach is validated on a multilingual real-world corpus (ATCSpeech), and the experimental results demonstrate that the proposed approach outperforms other baselines, achieving a 6.9% character error rate.

show abstract

wav2vec: Unsupervised Pre-training for Speech Recognition

Cited by 151 publications

References 0 publications

Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

K-Wav2vec 2.0: Automatic Speech Recognition based on Joint Decoding of Graphemes and Syllables

Speech recognition for air traffic control via feature learning and end-to-end training

Contact Info

Product

Resources

About