2022
DOI: 10.1109/taslp.2022.3195113
|View full text |Cite
|
Sign up to set email alerts
|

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Abstract: The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-time adaptation of DNN/TDNN and Conformer ASR models. These include: 1) speaker-level varianceregularized spectral … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 13 publications
(12 citation statements)
references
References 130 publications
(183 reference statements)
0
12
0
Order By: Relevance
“…The underlying variability of dysarthric speech manifesting in changes of spectral envelope, volume reduction, imprecise articulation and breathy or horse voices, can be modeled via disentangling the speech spectrum into time-invariant and time-variant subspaces [11,30] learned in a supervised manner [11,25]. The resulting spectral basis deep embedding (SBE) features are more effective in encoding latent attributes of im-paired speech [25] than classical speaker embeddings such as iVectors [44] and xVectors [45]. Hence, we adopt SBE features as speaker-severity aware auxiliary inputs to the hybrid DNN systems to incorporate both speaker-identity and speech impairment severity into the acoustic front-ends.…”
Section: Speaker-severity Aware Auxiliary Featuresmentioning
confidence: 99%
See 4 more Smart Citations
“…The underlying variability of dysarthric speech manifesting in changes of spectral envelope, volume reduction, imprecise articulation and breathy or horse voices, can be modeled via disentangling the speech spectrum into time-invariant and time-variant subspaces [11,30] learned in a supervised manner [11,25]. The resulting spectral basis deep embedding (SBE) features are more effective in encoding latent attributes of im-paired speech [25] than classical speaker embeddings such as iVectors [44] and xVectors [45]. Hence, we adopt SBE features as speaker-severity aware auxiliary inputs to the hybrid DNN systems to incorporate both speaker-identity and speech impairment severity into the acoustic front-ends.…”
Section: Speaker-severity Aware Auxiliary Featuresmentioning
confidence: 99%
“…For example, dysarthric speakers of very low speech intelligibility exhibit clearer patterns of articulatory imprecision, decreased volume and clarity, increased dysfluencies, slower speaking rate and changes in pitch [29], while those diagonalized with mid or high speech intelligibility are closer to normal speakers. Such heterogeneity further increases the mismatch against normal speech and the difficulty in both speaker-independent (SI) ASR system development using limited impaired speech data and fine-grained personalization to individual users' data [3,25,30] So far the majority of prior researches to address the dysarthric speaker level diversity have been focused on using speaker-identity only either in speaker-dependent (SD) data augmentation [7,9,13,14,18,27], or in speaker adapted or dependent ASR system development [1, 3, 4, 7, 11-13, 19, 22, 25, 31-33]. In contrast, very limited prior researches have used speech impairment severity information for dysarthric speech recognition.…”
Section: Introductionmentioning
confidence: 99%
See 3 more Smart Citations