SPICE: Self-Supervised Pitch Estimation

Gfeller, Beat; Frank, Christian; Roblek, Dominik; Sharifi, Matt; Tagliasacchi, Marco; Velimirović, Mihajlo

doi:10.1109/taslp.2020.2982285

Cited by 52 publications

(50 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Before describing the proposed method in the next section, we here explain the backgrounds by reviewing previous studies. Input signal representations have been studied for music information processing, including the short-time Fourier transform (STFT) [17, 18], the constant-Q transform (CQT) [6], and the log Mel-scale filter-bank [19]. Recently, the harmonic CQT (HCQT) representation, which is obtained by stacking pitch-shifted (upshifted and downshifted) CQT spectrograms, has been proposed [3].…”

Section: Backgroundsmentioning

confidence: 99%

“…To estimate the semitone-level pitches and tatum-level onset and offset times of musical notes from music signals, one may estimate a singing F0 trajectory [3][4][5][6] and then quantize it on the semitone and tatum grids obtained by a beat-tracking method [7], where the tatum (e.g. 16thnote level) refers to the smallest meaningful subdivision of the main beat (e.g.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Nishikimi

Nakamura

Goto

et al. 2021

SIP

View full text Add to dashboard Cite

This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.

show abstract

Section: Backgroundsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Nishikimi

Nakamura

Goto

et al. 2021

SIP

View full text Add to dashboard Cite

show abstract

“…Robot [136,[174][175][176][177][178] Computer vision [135,136,[178][179][180][181] Natural language processing [182,183] Reinforcement…”

Section: Automatic Generation Of Label Datamentioning

confidence: 99%

Object Detection Recognition and Robot Grasping Based on Machine Learning: A Survey

Bai

Yang

et al. 2020

IEEE Access

View full text Add to dashboard Cite

With the rapid development of machine learning, its powerful function in the machine vision field is increasingly reflected. The combination of machine vision and robotics to achieve the same precise and fast grasping as that of humans requires high-precision target detection and recognition, location and reasonable grasp strategy generation, which is the ultimate goal of global researchers and one of the prerequisites for the large-scale application of robots. Traditional machine learning has a long history and good achievements in the field of image processing and robot control. The CNN (convolutional neural network) algorithm realizes training of large-scale image datasets, solves the disadvantages of traditional machine learning in large datasets, and greatly improves accuracy, thereby positioning CNNs as a global research hotspot. However, the increasing difficulty of labeled data acquisition limits their development. Therefore, unsupervised learning, self-supervised learning and reinforcement learning, which are less dependent on labeled data, have also undergone rapid development and achieved good performance in the fields of image processing and robot capture. According to the inherent defects of vision, this paper summarizes the research achievements of tactile feedback in the fields of target recognition and robot grasping and finds that the combination of vision and tactile feedback can improve the success rate and robustness of robot grasping. This paper provides a systematic summary and analysis of the research status of machine vision and tactile feedback in the field of robot grasping and establishes a reasonable reference for future research.

show abstract

“…Fundamental frequency (F0) estimates often serve as mid-level representation [1] in music information retrieval (MIR) tasks such as automatic music transcription [2] and performance analysis [3,4]. There exist a variety of approaches for monophonic F0-estimation, ranging from model-based methods [5][6][7] to more recent deeplearning-based methods [8,9]. A monophonic F0-estimation algorithm typically outputs one F0-value per time instance together with a confidence value that indicates the algorithm's certainty whether the sound source is active or not (sometimes referred to as "voicing").…”

Section: Introductionmentioning

confidence: 99%

Reliability Assessment of Singing Voice F0-Estimates Using Multiple Algorithms

Rosenzweig

Scherbaum

Müller

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Over the last decades, various conceptually different approaches for fundamental frequency (F0) estimation in monophonic audio recordings have been developed. The algorithms' performances vary depending on the acoustical and musical properties of the input audio signal. A common strategy to assess the reliability (correctness) of an estimated F0-trajectory is to evaluate against an annotated reference. However, such annotations may not be available for a particular audio collection and are typically laborintensive to generate. In this work, we consider an approach to automatically assess the reliability of F0-trajectories estimated from monophonic singing voice recordings. As main contribution, we propose three reliability indicators that are based on the outputs of multiple algorithms. Besides providing a mathematical description of the indicators, we analyze the indicators' behavior using a set of annotated vocal F0-trajectories. Furthermore, we show the potential of the proposed indicators for exploring unlabeled audio collections.

show abstract

SPICE: Self-Supervised Pitch Estimation

Cited by 52 publications

References 27 publications

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Object Detection Recognition and Robot Grasping Based on Machine Learning: A Survey

Reliability Assessment of Singing Voice F0-Estimates Using Multiple Algorithms

Contact Info

Product

Resources

About