2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8462663
|View full text |Cite
|
Sign up to set email alerts
|

Domain Adversarial Training for Accented Speech Recognition

Abstract: In this paper, we propose a domain adversarial training (DAT) algorithm to alleviate the accented speech recognition problem. In order to reduce the mismatch between labeled source domain data ("standard" accent) and unlabeled target domain data (with heavy accents), we augment the learning objective for a Kaldi TDNN network with a domain adversarial training (DAT) objective to encourage the model to learn accentinvariant features. In experiments with three Mandarin accents, we show that DAT yields up to 7.45%… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
70
1

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
3
2

Relationship

2
8

Authors

Journals

citations
Cited by 93 publications
(71 citation statements)
references
References 16 publications
0
70
1
Order By: Relevance
“…(b) GAI embedding from F2. proach in spirits, aiming to learn accent-invariant features through gradient reversal [16]. The gradient reversal approach keeps modules of GAI , DAI and ASR model in Fig.…”
Section: Baselinesmentioning
confidence: 99%
“…(b) GAI embedding from F2. proach in spirits, aiming to learn accent-invariant features through gradient reversal [16]. The gradient reversal approach keeps modules of GAI , DAI and ASR model in Fig.…”
Section: Baselinesmentioning
confidence: 99%
“…In [26,27], the authors matched the distributions of the clean and distorted speeches in the feature space, and confirmed that the noise-invariant features were beneficial to robust acoustic models. In [28][29][30][31], speaker-invariant and accent-invariant features were extracted in a similar fashion for speaker recognition and speech recognition. In [32], Domain Separation Network with three network components was used to extract the features.…”
Section: Related Workmentioning
confidence: 99%
“…By doing so, it learns invariance to speaker characteristics. This approach has been previously used for speech recognition to learn invariance to noise conditions [9], speaker identity [10,11,12] and accent [13]. The authors in [11], proposed to minimize the senone classification loss, and simultaneously maximize the speaker classification loss.…”
Section: Introductionmentioning
confidence: 99%