Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2282
|View full text |Cite
|
Sign up to set email alerts
|

Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

3
4

Authors

Journals

citations
Cited by 22 publications
(19 citation statements)
references
References 22 publications
1
18
0
Order By: Relevance
“…By incorporating all the above techniques, the best recognition system produced an overall word error rate (WER) of 25.21% on the 22.6-hour UASpeech test set containing 16 dysarthric speakers. To the best of our knowledge, this is the lowest WER published so far on the same task reported in the literature [8]- [10], [40], [45], [47]. An overall WER reduction of 5.39% absolute (17.61% relative) was obtained over the CUHK 2018 system featuring a 6-way DNN system combination [10] which defined state-of-the-art performance at the time.…”
Section: Corpusmentioning
confidence: 78%
See 3 more Smart Citations
“…By incorporating all the above techniques, the best recognition system produced an overall word error rate (WER) of 25.21% on the 22.6-hour UASpeech test set containing 16 dysarthric speakers. To the best of our knowledge, this is the lowest WER published so far on the same task reported in the literature [8]- [10], [40], [45], [47]. An overall WER reduction of 5.39% absolute (17.61% relative) was obtained over the CUHK 2018 system featuring a 6-way DNN system combination [10] which defined state-of-the-art performance at the time.…”
Section: Corpusmentioning
confidence: 78%
“…Lastly, inspired by the bi-modal nature of human speech perception and the success of audio-visual speech recognition (AVSR) technologies when being applied to normal speech [42]- [44], visual information is further incorporated to improve disordered speech recognition performance. In order to address the data sparsity that arises from the difficulty to record large amounts of high quality audio-visual (AV) data, a cross-domain visual feature generation approach [45] was developed. High quality AV parallel data based on normal speech recording of the lip reading sentence (LRS2) dataset [46] was used to build neural AV inversion systems.…”
Section: Corpusmentioning
confidence: 99%
See 2 more Smart Citations
“…In spite of the swift progress of automatic speech recognition (ASR) technologies targeting normal speech in the past few decades [1][2][3][4][5][6][7][8][9], accurate recognition of disordered speech remains a demanding task to date [10][11][12][13][14][15][16]. The underlying causes of speech disorders include a wide range of neuro-motor conditions, such as cerebral palsy, Parkinson disease, amyotrophic lateral sclerosis and stroke or traumatic brain injuries [17].…”
Section: Introductionmentioning
confidence: 99%