REDAT: Accent-Invariant Representation for End-To-End ASR by Domain Adversarial Training with Relabeling

Hu, Hu; Yang, Xuesong; Raeesy, Zeynab; Guo, Jianmin; Keskin, Gokce; Arsikere, Harish; Rastrow, Ariya; Stolcke, Andreas; Maas, Roland

doi:10.1109/icassp39728.2021.9414291

Cited by 20 publications

(9 citation statements)

References 25 publications

(36 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The second approach often employs methods such as domain adversarial training and transfer learning in order to utilize as much available accented speech data as possible. Domain adversarial training (DAT) is a popular approach as it encourages models to learn accent-invariant features [47,19,21]. Transfer learning is another popular approach in L2 speech recognition, as it possibly allows a model to gain knowledge from both the base task and the new task, even when the new task has limited data [34,8,45].…”

Section: Related Workmentioning

confidence: 99%

Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding

Sullivan¹,

Shibano²,

Abdul-Mageed³

2022

Preprint

View full text Add to dashboard Cite

ASR systems designed for native English (L1) usually underperform on non-native English (L2). To address this performance gap, (i) we extend our previous work to investigate fine-tuning of a pre-trained wav2vec 2.0 model [2,56] under a rich set of L1 and L2 training conditions. We further (ii) incorporate language model decoding in the ASR system, along with the fine-tuning method. Quantifying gains acquired from each of these two approaches separately and an error analysis allows us to identify different sources of improvement within our models. We find that while the large self-trained wav2vec 2.0 may be internalizing sufficient decoding knowledge for clean L1 speech [56], this does not hold for L2 speech and accounts for the utility of employing language model decoding on L2 data.

show abstract

Section: Related Workmentioning

confidence: 99%

Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding

Sullivan¹,

Shibano²,

Abdul-Mageed³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…There have been many attempts to improve the recognition of accented speech, with varying degrees of success [7,8,9,10,11]. Some promising approaches include unsupervised adaptation [12,13], multitask learning with accent embeddings [14,15], and domain adversarial training [2,16]. While most approaches have delivered results, they either use massive amounts of accent data (e.g., 23K hours [2]), rely on corpora that are not publicly available [2,3], or use increasingly complex models [10,14,16] that do not shed light on how humans adapt so quickly to new accents.…”

Section: Related Workmentioning

confidence: 99%

“…Unlike noise, an accent is an intrinsic, speaker-dependent quality of speech, and humans are capable of understanding a novel accent within one minute of exposure [1]. However, machines require hundreds or even thousands of hours of speech data to get good performance [2,3]. This paper seeks to explore techniques inspired by human learning that go beyond merely gathering massive amounts of additional data to improve word error rate (WER) for accented speech recognition.…”

Section: Introductionmentioning

confidence: 99%

Accented Speech Recognition Inspired by Human Perception

Chu¹,

Combs²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

While improvements have been made in automatic speech recognition performance over the last several years, machines continue to have significantly lower performance on accented speech than humans. In addition, the most significant improvements on accented speech primarily arise by overwhelming the problem with hundreds or even thousands of hours of data. Humans typically require much less data to adapt to a new accent. This paper explores methods that are inspired by human perception to evaluate possible performance improvements for recognition of accented speech, with a specific focus on recognizing speech with a novel accent relative to that of the training data. Our experiments are run on small, accessible datasets that are available to the research community. We explore four methodologies: pre-exposure to multiple accents, grapheme and phoneme-based pronunciations, dropout (to improve generalization to a novel accent), and the identification of the layers in the neural network that can specifically be associated with accent modeling. Our results indicate that methods based on human perception are promising in reducing WER and understanding how accented speech is modeled in neural networks for novel accents.

show abstract

“…With this approach, it is believed that the output representations of the feature extractor can be domaininvariant, so the downstream model can perform comparable results in both source and target domains. [13][14][15][16] trained automatic speech recognition models to deal with accented speech with DAT. [17] proposed to train a multi-lingual speech emotion recognition model with adversarial domain adaptation.…”

Section: Introductionmentioning

confidence: 99%

Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation

Huang¹,

Yu-Kuan²,

Lee³

2022

Preprint

View full text Add to dashboard Cite

Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models. It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions while not hurting the original performance on clean speech. In this work, we propose to improve the robustness of speech processing models by domain adversarial training (DAT). We conducted experiments based on the SUPERB framework on five different speech processing tasks. In case we do not always have knowledge of the distortion types for speech data, we analyzed the binary-domain and multi-domain settings, where the former treats all distorted speech as one domain, and the latter views different distortions as different domains. In contrast to supervised training methods, we obtained promising results in target domains where speech data is distorted with different distortions including new unseen distortions introduced during testing.

show abstract

REDAT: Accent-Invariant Representation for End-To-End ASR by Domain Adversarial Training with Relabeling

Cited by 20 publications

References 25 publications

Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding

Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding

Accented Speech Recognition Inspired by Human Perception

Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation

Contact Info

Product

Resources

About