ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053089
|View full text |Cite
|
Sign up to set email alerts
|

GCI Detection from Raw Speech Using a Fully-Convolutional Network

Abstract: Glottal Closure Instants (GCI) detection consists in automatically detecting temporal locations of most significant excitation of the vocal tract from the speech signal. It is used in many speech analysis and processing applications, and various algorithms have been proposed for this purpose. Recently, new approaches using convolutional neural networks have emerged , with encouraging results. Following this trend, we propose a simple approach that performs a regression from the speech waveform to a target sign… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
2
2

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(16 citation statements)
references
References 23 publications
0
16
0
Order By: Relevance
“…In [279], the GCI detection was posed as a temporal event detection problem, relaxing the constraints used in [278]. In [279] and [280], the GCI detection was formulated using a representation learning perspective, where an appropriate representation is implicitly learned from the raw signal. In [281] and [282], a deep CNN-based GCI detection method was proposed by fusing raw speech and LP residual features.…”
Section: A Deep Learning For Gif and For Extraction Of F 0 And Gcimentioning
confidence: 99%
“…In [279], the GCI detection was posed as a temporal event detection problem, relaxing the constraints used in [278]. In [279] and [280], the GCI detection was formulated using a representation learning perspective, where an appropriate representation is implicitly learned from the raw signal. In [281] and [282], a deep CNN-based GCI detection method was proposed by fusing raw speech and LP residual features.…”
Section: A Deep Learning For Gif and For Extraction Of F 0 And Gcimentioning
confidence: 99%
“…Recently, it has been reported that the representation from different layers of wav2vec 2.0 exhibit different characteristics. Especially, Shah et al [39] showed that it is the output from the middle layer that has the most relevant characteristics to pronunciation 2 . In light of this empirical observation, we decided to use the intermediate features of XLSR-53.…”
Section: Analysis Featuresmentioning
confidence: 99%
“…Pitch Due to the irregular periodicity of the glottal pulse, we often hear creaky voice in speech, which is usually manifested as jitter or sub-harmonics in signals. This makes hard for f 0 trackers to estimate f 0 because the f 0 itself is not well defined in such cases [16,1,2]. We take a hint from the popular Yin algorithm to address this issue.…”
Section: Analysis Featuresmentioning
confidence: 99%
See 2 more Smart Citations