ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414291
|View full text |Cite
|
Sign up to set email alerts
|

REDAT: Accent-Invariant Representation for End-To-End ASR by Domain Adversarial Training with Relabeling

Abstract: Accents mismatching is a critical problem for end-to-end ASR. This paper aims to address this problem by building an accent-robust RNN-T system with domain adversarial training (DAT). We unveil the magic behind DAT and provide, for the first time, a theoretical guarantee that DAT learns accentinvariant representations. We also prove that performing the gradient reversal in DAT is equivalent to minimizing the Jensen-Shannon divergence between domain output distributions. Motivated by the proof of equivalence, w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
9
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 20 publications
(9 citation statements)
references
References 25 publications
(36 reference statements)
0
9
0
Order By: Relevance
“…The second approach often employs methods such as domain adversarial training and transfer learning in order to utilize as much available accented speech data as possible. Domain adversarial training (DAT) is a popular approach as it encourages models to learn accent-invariant features [47,19,21]. Transfer learning is another popular approach in L2 speech recognition, as it possibly allows a model to gain knowledge from both the base task and the new task, even when the new task has limited data [34,8,45].…”
Section: Related Workmentioning
confidence: 99%
“…The second approach often employs methods such as domain adversarial training and transfer learning in order to utilize as much available accented speech data as possible. Domain adversarial training (DAT) is a popular approach as it encourages models to learn accent-invariant features [47,19,21]. Transfer learning is another popular approach in L2 speech recognition, as it possibly allows a model to gain knowledge from both the base task and the new task, even when the new task has limited data [34,8,45].…”
Section: Related Workmentioning
confidence: 99%
“…There have been many attempts to improve the recognition of accented speech, with varying degrees of success [7,8,9,10,11]. Some promising approaches include unsupervised adaptation [12,13], multitask learning with accent embeddings [14,15], and domain adversarial training [2,16]. While most approaches have delivered results, they either use massive amounts of accent data (e.g., 23K hours [2]), rely on corpora that are not publicly available [2,3], or use increasingly complex models [10,14,16] that do not shed light on how humans adapt so quickly to new accents.…”
Section: Related Workmentioning
confidence: 99%
“…Unlike noise, an accent is an intrinsic, speaker-dependent quality of speech, and humans are capable of understanding a novel accent within one minute of exposure [1]. However, machines require hundreds or even thousands of hours of speech data to get good performance [2,3]. This paper seeks to explore techniques inspired by human learning that go beyond merely gathering massive amounts of additional data to improve word error rate (WER) for accented speech recognition.…”
Section: Introductionmentioning
confidence: 99%
“…With this approach, it is believed that the output representations of the feature extractor can be domaininvariant, so the downstream model can perform comparable results in both source and target domains. [13][14][15][16] trained automatic speech recognition models to deal with accented speech with DAT. [17] proposed to train a multi-lingual speech emotion recognition model with adversarial domain adaptation.…”
Section: Introductionmentioning
confidence: 99%