Unsupervised Representation Learning with Task-Agnostic Feature Masking for Robust End-to-End Speech Recognition

Kim, June-Woo; Chung, Hoon; Jung, Ho‐Young

doi:10.3390/math11030622

Cited by 2 publications

(3 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…v) On-the-fly VR-SBE plus FBK+VR-SBE driven f-LHUC adaptation (Sys.12) not only outperforms the comparable baselines replacing VR-SBE with iVector or xVector (Sys. [10][11], but also gives further WER reductions over VR-SBE adaptation alone by 0.64% abs. (2.03% rel.)…”

Section: ) Performance Analysesmentioning

confidence: 96%

“…significant improvement (α = 0.05) obtained over iVector (Sys. 2,10,14,22,27), xVector (Sys. 3,11,15,23,28), or both.…”

Section: A Experiments On Dysarthric Speechmentioning

confidence: 99%

“…to 10 ms (Sys. [3][4][5][6][7][8][9][10]. This suggests that VR-SBE features extracted on the fly can instantaneously capture homogeneous characteristics of dysarthric speakers.…”

Section: A Experiments On Dysarthric Speechmentioning

confidence: 99%

See 2 more Smart Citations

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Xie

Wang

et al. 2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-time adaptation of DNN/TDNN and Conformer ASR models. These include: 1) speaker-level varianceregularized spectral basis embedding (VR-SBE) features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation; and 2) feature-based learning hidden unit contributions (f-LHUC) transforms that are conditioned on VR-SBE features. Experiments are conducted on four tasks across two languages: the English UASpeech and TORGO dysarthric speech datasets, the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora. The proposed on-the-fly speaker adaptation techniques consistently outperform baseline iVector and xVector adaptation by statistically significant word or character error rate reductions up to 5.32% absolute (18.57% relative) and batch-mode LHUC speaker adaptation by 2.24% absolute (9.20% relative), while operating with real-time factors speeding up to 33.6 times against xVectors during adaptation. The efficacy of the proposed adaptation techniques is demonstrated in a comparison against current ASR technologies including SSL pre-trained systems on UASpeech, where our best system produces a state-of-the-art WER of 23.33%. Analyses show VR-SBE features and f-LHUC transforms are insensitive to speaker-level data quantity in testtime adaptation. T-SNE visualization reveals they have stronger speaker-level homogeneity than baseline iVectors, xVectors and batch-mode LHUC transforms.

show abstract

Section: ) Performance Analysesmentioning

confidence: 96%

“…significant improvement (α = 0.05) obtained over iVector (Sys. 2,10,14,22,27), xVector (Sys. 3,11,15,23,28), or both.…”

Section: A Experiments On Dysarthric Speechmentioning

confidence: 99%

See 1 more Smart Citation

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Xie

Wang

et al. 2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

Spectral Salt-and-Pepper Patch Masking for Self-Supervised Speech Representation Learning

Kim,

Chung,

Jung

2023

Mathematics

View full text Add to dashboard Cite

Recent advanced systems in the speech recognition domain use large Transformer neural networks that have been pretrained on massive speech data. General methods in the deep learning area have been frequently shared across various domains, and the Transformer model can also be used effectively across speech and image. In this paper, we introduce a novel masking method for self-supervised speech representation learning with salt-and-pepper (S&P) mask which is commonly used in computer vision. The proposed scheme includes consecutive quadrilateral-shaped S&P patches randomly contaminating the input speech spectrum. Furthermore, we modify the standard S&P mask to make it appropriate for the speech domain. In order to validate the effect of the proposed spectral S&P patch masking for the self-supervised representation learning approach, we conduct the pretraining and downstream experiments with two languages, English and Korean. To this end, we pretrain the speech representation model using each dataset and evaluate the pretrained models for feature extraction and fine-tuning performance on varying downstream tasks, respectively. The experimental outcomes clearly illustrate that the proposed spectral S&P patch masking is effective for various downstream tasks when combined with the conventional masking methods.

show abstract

Unsupervised Representation Learning with Task-Agnostic Feature Masking for Robust End-to-End Speech Recognition

Cited by 2 publications

References 44 publications

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Spectral Salt-and-Pepper Patch Masking for Self-Supervised Speech Representation Learning

Contact Info

Product

Resources

About