Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-168
|View full text |Cite
|
Sign up to set email alerts
|

Adversarial Data Augmentation for Disordered Speech Recognition

Abstract: Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personalized disordered speech augmentation approaches that simultaneously learn to encode, generate and discriminate synthe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 22 publications
(20 citation statements)
references
References 36 publications
0
20
0
Order By: Relevance
“…In contrast, related previous works either used: a) synthesized normal speech acoustic-articulatory features trained A2A inversion models before being applied to dysarthric speech [23], while the large mismatch between normal and impaired speech encountered during inversion model training and articulatory feature generation stages was not taken into account; or b) only considered the cross-domain or cross-corpus A2A inversion [25] while the quality of generated articulatory features was not assessed using the back-end disordered speech recognition systems. In addition, the lowest published WER of 24.82% on the benchmark UASpeech task in comparison against recent researches [8][9][10][11][12][13][37][38][39] was obtained using the proposed cross-domain acoustic-to-articulatory inversion approach.…”
Section: Introductionmentioning
confidence: 77%
See 2 more Smart Citations
“…In contrast, related previous works either used: a) synthesized normal speech acoustic-articulatory features trained A2A inversion models before being applied to dysarthric speech [23], while the large mismatch between normal and impaired speech encountered during inversion model training and articulatory feature generation stages was not taken into account; or b) only considered the cross-domain or cross-corpus A2A inversion [25] while the quality of generated articulatory features was not assessed using the back-end disordered speech recognition systems. In addition, the lowest published WER of 24.82% on the benchmark UASpeech task in comparison against recent researches [8][9][10][11][12][13][37][38][39] was obtained using the proposed cross-domain acoustic-to-articulatory inversion approach.…”
Section: Introductionmentioning
confidence: 77%
“…Data augmentation techniques play a vital role to address the data sparsity problem in current disordered speech recognition systems [37,38]. Spectral-temporal perturbation of the limited audio data collected from impaired speakers is normally used to inject more diversity into the augmented data to improve the resulting ASR system generalization on the same task, for example, the TORGO corpus.…”
Section: Acoustic-to-articulatory Inversion 31 In-domain A2a Inversionmentioning
confidence: 99%
See 1 more Smart Citation
“…In contrast, existing data augmentation studies for dysarthric and elderly speech mainly focus on using signal-level tempo or speed perturbation based methods [14], [15], [33], [69], [70]. The only previous research on GAN based dysarthric speech augmentation required the explicit use of parallel speech data of the UASpeech corpus [109], [110]. The non-parallel GAN based normal to pathological voice conversion approach studied in [111] is evaluated on naturalness and severity, but not measured in terms of the performance of ASR systems constructed using the generated data.…”
Section: Corpusmentioning
confidence: 99%
“…The overall architecture configurations of the proposed DC-GAN model follow our previous work [110], also again shown A flattening operation is applied to concatenate the outputs of convolutional layers, resulting in a 3000-dimensional vector.…”
Section: A Dcgan Model Architecturementioning
confidence: 99%