Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-60
|View full text |Cite
|
Sign up to set email alerts
|

Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 11 publications
(10 citation statements)
references
References 38 publications
0
10
0
Order By: Relevance
“…In contrast, related previous works either used: a) synthesized normal speech acoustic-articulatory features trained A2A inversion models before being applied to dysarthric speech [23], while the large mismatch between normal and impaired speech encountered during inversion model training and articulatory feature generation stages was not taken into account; or b) only considered the cross-domain or cross-corpus A2A inversion [25] while the quality of generated articulatory features was not assessed using the back-end disordered speech recognition systems. In addition, the lowest published WER of 24.82% on the benchmark UASpeech task in comparison against recent researches [8][9][10][11][12][13][37][38][39] was obtained using the proposed cross-domain acoustic-to-articulatory inversion approach.…”
Section: Introductionmentioning
confidence: 77%
See 1 more Smart Citation
“…In contrast, related previous works either used: a) synthesized normal speech acoustic-articulatory features trained A2A inversion models before being applied to dysarthric speech [23], while the large mismatch between normal and impaired speech encountered during inversion model training and articulatory feature generation stages was not taken into account; or b) only considered the cross-domain or cross-corpus A2A inversion [25] while the quality of generated articulatory features was not assessed using the back-end disordered speech recognition systems. In addition, the lowest published WER of 24.82% on the benchmark UASpeech task in comparison against recent researches [8][9][10][11][12][13][37][38][39] was obtained using the proposed cross-domain acoustic-to-articulatory inversion approach.…”
Section: Introductionmentioning
confidence: 77%
“…Their difficulty in using keyboard, mouse and touch screen based user interfaces makes speech controlled assistive technologies more natural alternatives [7], even though speech quality is degraded. Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech is still a very challenging task to date due to severe mismatch against normal speech, difficulty in large scale data collection for system development and high level of variability among speakers [8][9][10][11][12][13].…”
Section: Introductionmentioning
confidence: 99%
“…Kim et al, 2016;Seong et al, 2016) and a deep learning model (Xiong et al, 2018). Overcoming the limitations of dysarthric speech as training data, researchers (1) used models that require less training data (Gemmeke et al, 2014), (2) augment data by artificially generating dysarthric speech (Green et al, 2021;Jin et al, 2021;Ko et al, 2017;Liu et al, 2021;Mariya Celin et al, 2020;Vachhani et al, 2018;Xiong et al, 2019), and (3) adapt data to a given speaker (Geng et al, 2021;Takashima et al, 2020). Further, Sriranjani et al (2015) used "data pooling," in which normal speech recordings were pooled from databases and combined with dysarthric speech data to train systems.…”
Section: Rq1: How the Characteristics Of Users' Speech Affect Their I...mentioning
confidence: 99%
“…To this end, this paper investigates a novel set of techniques to incorporate speech impairment severity into state-ofthe-art hybrid DNN [13], end-to-end Conformer [37] and selfsupervised learning (SSL) based pre-trained Wav2vec 2.0 [38] ASR systems. These include the use of: a) multi-task [39] training cost interpolation between the ASR loss and speech impairment severity prediction error; b) spectral basis embedding (SBE) [11,25] based speaker-severity aware adaptation features that are trained to simultaneously predict both speaker-identity and impairment severity; and c) structured learning hidden units contribution (LHUC) [40] transforms that are separately conditioned on speaker-identity and impairment severity. These are used to facilitate both speaker-severity adaptive training of ASR systems and their test-time unsupervised adaptation to both factors of variability.…”
Section: Introductionmentioning
confidence: 99%