GCI Detection from Raw Speech Using a Fully-Convolutional Network

Ardaillon, Luc; Röebel, Axel

doi:10.1109/icassp40776.2020.9053089

Cited by 12 publications

(16 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [279], the GCI detection was posed as a temporal event detection problem, relaxing the constraints used in [278]. In [279] and [280], the GCI detection was formulated using a representation learning perspective, where an appropriate representation is implicitly learned from the raw signal. In [281] and [282], a deep CNN-based GCI detection method was proposed by fusing raw speech and LP residual features.…”

Section: A Deep Learning For Gif and For Extraction Of F 0 And Gcimentioning

confidence: 99%

Extraction and Utilization of Excitation Information of Speech: A Review

2021

View full text Add to dashboard Cite

| Speech production can be regarded as a process where a time-varying vocal tract system (filter) is excited by a time-varying excitation. In addition to its linguistic message, the speech signal also carries information about, for example, the gender and age of the speaker. Moreover, the speech signal includes acoustical cues about several speaker traits, such as the emotional state and the state of health of the speaker. In order to understand the production of these acoustical cues by the human speech production mechanism and utilize this information in speech technology, it is necessary to extract features describing both the excitation and the filter of the human speech production mechanism. While the methods to estimate and parameterize the vocal tract system are well established, the excitation appears less studied. This article provides a review of signal processing approaches used for the extraction of excitation information from speech. This article highlights the importance of excitation information in the analysis and classification of phonation type and vocal emotions, in the analysis of nonverbal laughter sounds, and in studying pathological voices. Furthermore, recent developments of deep learning techniques in the context of extraction and utilization of the excitation information are discussed.

show abstract

Section: A Deep Learning For Gif and For Extraction Of F 0 And Gcimentioning

confidence: 99%

Extraction and Utilization of Excitation Information of Speech: A Review

2021

View full text Add to dashboard Cite

show abstract

“…Recently, it has been reported that the representation from different layers of wav2vec 2.0 exhibit different characteristics. Especially, Shah et al [39] showed that it is the output from the middle layer that has the most relevant characteristics to pronunciation 2 . In light of this empirical observation, we decided to use the intermediate features of XLSR-53.…”

Section: Analysis Featuresmentioning

confidence: 99%

“…Pitch Due to the irregular periodicity of the glottal pulse, we often hear creaky voice in speech, which is usually manifested as jitter or sub-harmonics in signals. This makes hard for f 0 trackers to estimate f 0 because the f 0 itself is not well defined in such cases [16,1,2]. We take a hint from the popular Yin algorithm to address this issue.…”

Section: Analysis Featuresmentioning

confidence: 99%

“…Note that in every voice conversion experiment, we shifted the median pitch of a source utterance to the median pitch of a target utterance by shifting the scope of Yingram. We measured naturalness with 5-scale mean opinion score (MOS [1][2][3][4][5]). Speaker similarity (SSIM (%)) were measured with a binary decision and uncertainty options, following [50].…”

Section: Voice Conversionmentioning

confidence: 99%

“…In addition, we propose a new feature that can effectively represent and control pitch information. Although it is the fundamental frequency (f 0 ) that is mostly used to represent the pitch information, f 0 is sometimes ill-defined when there exists sub-harmonics in the signal (e.g., vocal fry) [16,1,2]. We address this issue by proposing a controllable but more abstract feature than f 0 that still includes information such as sub-harmonics.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

Choi¹,

Lee²,

Kim³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present a neural analysis and synthesis (NANSY) framework that can manipulate voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features for controllable synthesis, which usually results in poor reconstruction quality. We address this issue by proposing a novel training strategy based on information perturbation. The idea is to perturb information in the original input signal (e.g., formant, pitch, and frequency response), thereby letting synthesis networks selectively take essential attributes to reconstruct the input signal. Because NANSY does not need any bottleneck structures, it enjoys both high reconstruction quality and controllability. Furthermore, NANSY does not require any labels associated with speech data such as text and speaker information, but rather uses a new set of analysis features, i.e., wav2vec feature and newly proposed pitch feature, Yingram, which allows for fully self-supervised training. Taking advantage of fully selfsupervised training, NANSY can be easily extended to a multilingual setting by simply training it with a multilingual dataset. The experiments show that NANSY can achieve significant improvement in performance in several applications such as zero-shot voice conversion, pitch shift, and time-scale modification 1 .

show abstract

On Comparison of XGBoost and Convolutional Neural Networks for Glottal Closure Instant Detection

Vraštil

Matoušek

2021

Text, Speech, and Dialogue

View full text Add to dashboard Cite

GCI Detection from Raw Speech Using a Fully-Convolutional Network

Cited by 12 publications

References 23 publications

Extraction and Utilization of Excitation Information of Speech: A Review

Extraction and Utilization of Excitation Information of Speech: A Review

Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

On Comparison of XGBoost and Convolutional Neural Networks for Glottal Closure Instant Detection

Contact Info

Product

Resources

About