2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2014
DOI: 10.1109/icassp.2014.6854059
|View full text |Cite
|
Sign up to set email alerts
|

UT-Vocal Effort II: Analysis and constrained-lexicon recognition of whispered speech

Abstract: This study focuses on acoustic variations in speech introduced by whispering, and proposes several strategies to improve robustness of automatic speech recognition of whispered speech with neutral-trained acoustic models. In the analysis part, differences in neutral and whispered speech captured in the UT-Vocal Effort II corpus are studied in terms of energy, spectral slope, and formant center frequency and bandwidth distributions in silence, voiced, and unvoiced speech signal segments. In the part dedicated t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
17
0

Year Published

2015
2015
2018
2018

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 24 publications
(19 citation statements)
references
References 23 publications
2
17
0
Order By: Relevance
“…As shown in the third column of Table 4, the recognizer reaches a phoneme error rate (PER) of 27.5% on the neutral task and 45.0% on the whisper task. The results are in agreement with the results in Ghaffarzadegan et al (2014b), where whisper-trained recognizer tested on whisper data yielded a lower recognition rate compared to a neutral-trained model tested on neutral data. As discussed in Ghaffarzadegan et al (2014b), the main cause of this disparity may be attributed to the higher confusability of the whisper phone set where voiced and unvoiced phone groups seen in neutral speech are now all mapped to the unvoiced acoustic space, causing wide overlaps of originally voiced and unvoiced fricative and stop pairs.…”
Section: Baseline Experimentssupporting
confidence: 82%
“…As shown in the third column of Table 4, the recognizer reaches a phoneme error rate (PER) of 27.5% on the neutral task and 45.0% on the whisper task. The results are in agreement with the results in Ghaffarzadegan et al (2014b), where whisper-trained recognizer tested on whisper data yielded a lower recognition rate compared to a neutral-trained model tested on neutral data. As discussed in Ghaffarzadegan et al (2014b), the main cause of this disparity may be attributed to the higher confusability of the whisper phone set where voiced and unvoiced phone groups seen in neutral speech are now all mapped to the unvoiced acoustic space, causing wide overlaps of originally voiced and unvoiced fricative and stop pairs.…”
Section: Baseline Experimentssupporting
confidence: 82%
“…This degradation is not as pronounced if an AM trained from whispered speech were to be used to recognize neutral speech. This result has been consistent for experiments done in several languages for which there is a sizeable corpus of parallel whispered speech, such as Serbian [3], Japanese [4], Mandarin [5] and English [6,7]. The pattern is also consistent for different types of models, both Gaussian Mixture Models (GMM) trained with generative or discriminative methods, and even for Deep Neural Network (DNN) based AMs.…”
Section: Related Worksupporting
confidence: 81%
“…The use of Teager energy cepstral coefficients with deep denoising autoencoder (DDA) has recently brought many benefits in speaker dependent (SD) neutraltrained whisper recognition [18]. Likewise, performances of speaker independent (SI) recognition of whispered speech have been significantly improved after adapting the acoustic model toward the DDA pseudo-whisper samples, compared to the model adaptation on an available small whisper set (for UT-Vocal Effort II speech corpus) [13,15]. However, to the best of our knowledge, comparison of different speech recognition tools that include SVM in recognition of whispered speech was not reported.…”
Section: Related Workmentioning
confidence: 99%