Statistical Regression Models for Noise Robust F0 Estimation Using Recurrent Deep Neural Networks

Kato, Akihiro; Kinnunen, Tomi

doi:10.1109/taslp.2019.2945489

Cited by 7 publications

(4 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The experimental results are summarized in Table 1. Although the performance of UDS-based pitch estimation is not as high as the speech-based pitch estimation (typically, GPE rate < 20% for clean speech signal [42]), the superiority of the UDS signal in terms of pitch estimation is clearly found for all metrics. Such results suggest that UDS provides more useful information for pitch estimation.…”

Section: Performance Of Pitch Estimation and V/uv Decisionsmentioning

confidence: 94%

“…Although the samples for each modality were not recorded simultaneously (because of changes in the shapes of the mouth region by attaching the EMG electrodes), the differences in speech signals among the modalities were minimized by using the common utterance set and asking the subjects to pronounce each word in a consistent manner. The performance of pitch estimation was evaluated using the two standard metrics: the gross pitch error (GPE) rate and the fine pitch error (FPE) [42]. The GPE frames are defined as voiced frames where the error between the estimated pitch period and the ground truth is greater than 0.625 ms.…”

Section: Performance Of Pitch Estimation and V/uv Decisionsmentioning

confidence: 99%

See 1 more Smart Citation

Ultrasonic Doppler Based Silent Speech Interface Using Perceptual Distance

Lee

2022

Applied Sciences

View full text Add to dashboard Cite

Moderate performance in terms of intelligibility and naturalness can be obtained using previously established silent speech interface (SSI) methods. Nevertheless, a common problem associated with SSI has involved deficiencies in estimating the spectrum details, which results in synthesized speech signals that are rough, harsh, and unclear. In this study, harmonic enhancement (HE), was used during postprocessing to alleviate this problem by emphasizing the spectral fine structure of speech signals. To improve the subjective quality of synthesized speech, the difference between synthesized and actual speech was established by calculating the distance in the perceptual domains instead of using the conventional mean square error (MSE). Two deep neural networks (DNNs) were employed to separately estimate the speech spectra and the filter coefficients of HE, connected in a cascading manner. The DNNs were trained to incrementally and iteratively minimize both the MSE and the perceptual distance (PD). A feasibility test showed that the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility measure (STOI) were improved by 17.8 and 2.9%, respectively, compared with previous methods. Subjective listening tests revealed that the proposed method yielded perceptually preferred results compared with that of the conventional MSE-based method.

show abstract

Section: Performance Of Pitch Estimation and V/uv Decisionsmentioning

confidence: 94%

Section: Performance Of Pitch Estimation and V/uv Decisionsmentioning

confidence: 99%

Ultrasonic Doppler Based Silent Speech Interface Using Perceptual Distance

Lee

2022

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Periodicity estimation with statistical pitch estimators has been treated inconsistently in recent literature on neural pitch estimation. Some studies omit the evaluation of periodicity or voicing [25], [27]. Others demonstrate binary voicing classification that-at best-slightly outperforms DSP-based baselines [33], [43].…”

Section: Estimatorsmentioning

confidence: 99%

“…Notable exceptions to the candidate-generation/candidateselection paradigm that do not produce a sequence of scores for subsequent decoding include the self-supervised SPICE [33] and the sinusoidal regression method by Kato et al [27]. While these methods are interesting, they are significantly more complicated than state-of-the-art neural methods trained in supervised classification paradigm, without substantial gains in performance or speed.…”

mentioning

confidence: 99%

Cross-domain Neural Pitch and Periodicity Estimation

Morrison¹,

Hsieh²,

Pruyne³

et al. 2023

Preprint

View full text Add to dashboard Cite

Pitch is a foundational aspect of our perception of audio signals. Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks, including music transcription, singing voice synthesis, and prosody editing. In this paper, we describe a set of techniques for improving the accuracy of state-of-the-art neural pitch and periodicity estimators. We also introduce a novel entropybased method for extracting periodicity and per-frame voicedunvoiced classifications from statistical inference-based pitch estimators (e.g., neural networks), and show how to train a neural pitch estimator to simultaneously handle speech and music without performance degradation. While neural pitch trackers have historically been significantly slower than signal processing based pitch trackers, our estimator implementations approach the speed of state-of-the-art DSP-based pitch estimators on a standard CPU, but with significantly more accurate pitch and periodicity estimation. Our experiments show that an accurate, cross-domain pitch and periodicity estimator written in PyTorch with a hopsize of ten milliseconds can run 11.2x faster than real-time on a Intel i9-9820X 10-core 3.30 GHz CPU or 408x faster than real-time on a NVIDIA GeForce RTX 3090 GPU without hardware optimization. We release all of our code and models as Pitch-Estimating Neural Networks (penn), an open-source, pip-installable Python module for training, evaluating, and performing inference with pitch-and periodicityestimating neural networks. The code for penn is available at github.com/interactiveaudiolab/penn.

show abstract