Improved Spoken Language Representation for Intent Understanding in a Task-Oriented Dialogue System

Kim, June-Woo; Yoon, Hyekyung; Jung, Ho‐Young

doi:10.3390/s22041509

Cited by 3 publications

(1 citation statement)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To make matters worse, since most of the benchmark curated speech datasets [2][3][4][5] are built with very limited diversity, mostly representing healthy adults, it is challenging to accurately recognize the speech of children [9,10], the elderly [11][12][13], or those using dialects. Consequently, speech recognition performance suffers in highly variable scenarios, such as far-field or noisy environments [6][7][8]14,15], where the conditions or personal characteristics [16][17][18][19] degrade the performance compared with normal speech. In addition, recognizing new or trending words is important for ASR systems, but updating already built end-to-end ASR systems every time is time-consuming and resource-intensive.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Representation Learning with Task-Agnostic Feature Masking for Robust End-to-End Speech Recognition

2023

Self Cite

View full text Add to dashboard Cite

Unsupervised learning-based approaches for training speech vector representations (SVR) have recently been widely applied. While pretrained SVR models excel in relatively clean automatic speech recognition (ASR) tasks, such as those recorded in laboratory environments, they are still insufficient for practical applications with various types of noise, intonation, and dialects. To cope with this problem, we present a novel unsupervised SVR learning method for practical end-to-end ASR models. Our approach involves designing a speech feature masking method to stabilize SVR model learning and improve the performance of the ASR model in a downstream task. By introducing a noise masking strategy into diverse combinations of the time and frequency regions of the spectrogram, the SVR model becomes a robust representation extractor for the ASR model in practical scenarios. In pretraining experiments, we train the SVR model using approximately 18,000 h of Korean speech datasets that included diverse speakers and were recorded in environments with various amounts of noise. The weights of the pretrained SVR extractor are then frozen, and the extracted speech representations are used for ASR model training in a downstream task. The experimental results show that the ASR model using our proposed SVR extractor significantly outperforms conventional methods.

show abstract