A pitch tracking corpus with evaluation on multipitch tracking scenario

Pirker, Gregor; Wohlmayr, Michael; Petrik, Stefan; Pernkopf, Franz

doi:10.21437/interspeech.2011-317

Cited by 87 publications

(16 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For training, we used the TIMIT [9] and PTDB-TUG speech datasets [20]. During the training, different scenarios are simulated where either one or two sources are concurrently active, similar to [2].…”

Section: Trainingmentioning

confidence: 99%

Improved Separation of Closely-spaced Speakers by Exploiting Auxiliary Direction of Arrival Information within a U-Net Architecture

Kindt

Bohlender

Madhu

2022

2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)

View full text Add to dashboard Cite

Microphone arrays use spatial diversity for separating concurrent audio sources. Source signals from different directions of arrival (DOAs) are captured with DOAdependent time-delays between the microphones. These can be exploited in the short-time Fourier transform domain to yield time-frequency masks that extract a target signal while suppressing unwanted components. Using deep neural networks (DNNs) for mask estimation has drastically improved separation performance. However, separation of closely spaced sources remains difficult due to their similar inter-microphone time delays. We propose using auxiliary information on source DOAs within the DNN to improve the separation. This can be encoded by the expected phase differences between the microphones. Alternatively, the DNN can learn a suitable input representation on its own when provided with a multi-hot encoding of the DOAs. Experimental results demonstrate the benefit of this information for separating closely spaced sources.

show abstract

Section: Trainingmentioning

confidence: 99%

Improved Separation of Closely-spaced Speakers by Exploiting Auxiliary Direction of Arrival Information within a U-Net Architecture

Kindt

Bohlender

Madhu

2022

2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)

View full text Add to dashboard Cite

show abstract

“…PTDB [31] is the dataset most commonly used in recent work on pitch estimation for speech. For this reason, we use PTDB as a representation of performance on speech data.…”

Section: A Datamentioning

confidence: 99%

“…accurately predict the fundamental frequency of speech on the PTDB dataset [31], while estimators with larger receptive fields [29], [30] are able to. As we will show, CREPE has no difficulty learning accurate pitch on PTDB-when the atypical and undocumented alignment between the audio and pitch of PTDB is addressed-and the generalization gap is due to mismatched data distributions between training and evaluation.…”

mentioning

confidence: 99%

Cross-domain Neural Pitch and Periodicity Estimation

Morrison¹,

Hsieh²,

Pruyne³

et al. 2023

Preprint

View full text Add to dashboard Cite

Pitch is a foundational aspect of our perception of audio signals. Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks, including music transcription, singing voice synthesis, and prosody editing. In this paper, we describe a set of techniques for improving the accuracy of state-of-the-art neural pitch and periodicity estimators. We also introduce a novel entropybased method for extracting periodicity and per-frame voicedunvoiced classifications from statistical inference-based pitch estimators (e.g., neural networks), and show how to train a neural pitch estimator to simultaneously handle speech and music without performance degradation. While neural pitch trackers have historically been significantly slower than signal processing based pitch trackers, our estimator implementations approach the speed of state-of-the-art DSP-based pitch estimators on a standard CPU, but with significantly more accurate pitch and periodicity estimation. Our experiments show that an accurate, cross-domain pitch and periodicity estimator written in PyTorch with a hopsize of ten milliseconds can run 11.2x faster than real-time on a Intel i9-9820X 10-core 3.30 GHz CPU or 408x faster than real-time on a NVIDIA GeForce RTX 3090 GPU without hardware optimization. We release all of our code and models as Pitch-Estimating Neural Networks (penn), an open-source, pip-installable Python module for training, evaluating, and performing inference with pitch-and periodicityestimating neural networks. The code for penn is available at github.com/interactiveaudiolab/penn.

show abstract

“…The four methods were evaluated on the PTDB-TUG database [15]. The database contains clean utterances from 20 speakers (10 males and 10 females).…”

Section: A Experimental Setupmentioning

confidence: 99%

Improved CEM for Speech Harmonic Enhancement in Single Channel Noise Suppression

Song

Madhu

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

The periodic nature of voiced speech is often exploited to restore speech harmonics and to increase interharmonic noise suppression. In particular, a recent paper proposed to do this by manipulating the speech harmonic frequencies in the cepstral domain. The manipulations were carried out on the cepstrum of the excitation signal, obtained by the sourcefilter decomposition of speech. This method was termed Cepstral Excitation Manipulation (CEM). In this contribution we further analyse this method, point out its inherent weakness and propose means to overcome it. First of all, it will be shown by both illustrative examples and theoretical analysis that the existing method underestimates the excitation, especially at low signal to noise ratio (SNR) conditions. This inherent weakness leads to speech harmonic weakening and vocoding due to the insufficient noise suppression in the inter-harmonic regions. Then, we propose two modifications to improve the robustness and performance of CEM in low SNR cases. The first modification is to use an instantaneous amplifying factor adapted to the signal, instead of a pre-defined constant, for the excitation cepstrum. The second modification is to smooth the excitation cepstrum to preserve additional fine structure, instead of discarding it. These modifications result in better preservation of speech harmonics, more refined fine structure and higher inter-harmonic noise suppression. Experimental evaluations using a range of standard instrumental metrics conclusively demonstrate that our proposed modifications clearly outperform the existing method, especially in extremely noisy conditions.

show abstract

A pitch tracking corpus with evaluation on multipitch tracking scenario

Cited by 87 publications

References 10 publications

Improved Separation of Closely-spaced Speakers by Exploiting Auxiliary Direction of Arrival Information within a U-Net Architecture

Improved Separation of Closely-spaced Speakers by Exploiting Auxiliary Direction of Arrival Information within a U-Net Architecture

Cross-domain Neural Pitch and Periodicity Estimation

Improved CEM for Speech Harmonic Enhancement in Single Channel Noise Suppression

Contact Info

Product

Resources

About