Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-363
|View full text |Cite
|
Sign up to set email alerts
|

Effects of Training Data Variety in Generating Glottal Pulses from Acoustic Features with DNNs

Abstract: Glottal volume velocity waveform, the acoustical excitation of voiced speech, cannot be acquired through direct measurements in normal production of continuous speech. Glottal inverse filtering (GIF), however, can be used to estimate the glottal flow from recorded speech signals. Unfortunately, the usefulness of GIF algorithms is limited since they are sensitive to noise and call for high-quality recordings. Recently, efforts have been taken to expand the use of GIF by training deep neural networks (DNNs) to l… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

1
1
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
2

Relationship

2
0

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 21 publications
1
1
0
Order By: Relevance
“…Based on this difference for GlottHMM, it can be argued that the shortcomings of the spectral model (from which GlottHMM suffers in analysis-synthesis) are averaged out in current SPSS acoustic models. Furthermore, when comparing the performance differences between GlottHMM and GlottDNN, we speculate that the quality of the DNNbased excitation of GlottDNN is highly voice-specific (for example, the performance of "Roger" decreased considerably because of the picked-up reverberation), as is also concluded in a separate study [64]. The simple excitation generation based on a high-quality glottal pulse utilized by GlottHMM is thus a safer option for more challenging voices.…”
Section: Discussionsupporting
confidence: 54%
“…Based on this difference for GlottHMM, it can be argued that the shortcomings of the spectral model (from which GlottHMM suffers in analysis-synthesis) are averaged out in current SPSS acoustic models. Furthermore, when comparing the performance differences between GlottHMM and GlottDNN, we speculate that the quality of the DNNbased excitation of GlottDNN is highly voice-specific (for example, the performance of "Roger" decreased considerably because of the picked-up reverberation), as is also concluded in a separate study [64]. The simple excitation generation based on a high-quality glottal pulse utilized by GlottHMM is thus a safer option for more challenging voices.…”
Section: Discussionsupporting
confidence: 54%
“…It is known that signal processing-based GIF methods are affected by distortions in the speech signal due to ambient noise, the poor audio quality of the recording equipment, and compression and bandwidth limitation caused by speech transmission [30], [269]. To address this issue, a few recent studies [269]- [271] have proposed using DNN-based methods for estimation of the glottal source waveform. In [269], coded telephone quality speech was studied using a DNN-based GIF method by using both clean and coded speech in training.…”
Section: A Deep Learning For Gif and For Extraction Of F 0 And Gcimentioning
confidence: 99%