Joint Singing Pitch Estimation and Voice Separation Based on a Neural Harmonic Structure Renderer

Nakano, Takanori; Yoshii, Kazuyoshi; Wu, Yiming; Nishikimi, Ryo; Lin, Kin Wah Edward; Goto, Masataka

doi:10.1109/waspaa.2019.8937135

Cited by 7 publications

(8 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A voice separation method and a beat-tracking method are used in the preprocessing step in the present method, and we observed that errors made in the preprocessing step can propagate to the transcription results. To mitigate the problem, multi-task learning of the singing voice separation and the AST can also be effective in obtaining the singing voices appropriate for the AST [5]. A beat-tracking method typically estimates beat times in the accompaniment sounds, which can be slightly shifted from the onset times of the singing voice due to the asynchrony between the vocal and the other parts [36].…”

Section: E) Discussionmentioning

confidence: 99%

“…Inspired by the CNN proposed for frame-level melody F0 estimation [3], the frame-level CNN of the acoustic model (Fig. 5) was designed to have six convolution layers with the output channels of 128, 64, 64, 64, 8, and 1 and the kernel sizes of (5, 5), (5,5), (3,3), (3,3), (70, 3), and (1, 1), respectively, where the instance normalization [31] and the Mish function [32] are used. The output dimension of the tatumlevel BLSTM was set to D = 130 × 2.…”

Section: B) Setupmentioning

confidence: 99%

“…To estimate the semitone-level pitches and tatum-level onset and offset times of musical notes from music signals, one may estimate a singing F0 trajectory [3][4][5][6] and then quantize it on the semitone and tatum grids obtained by a beat-tracking method [7], where the tatum (e.g. 16thnote level) refers to the smallest meaningful subdivision of the main beat (e.g.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Nishikimi

Nakamura

Goto

et al. 2021

SIP

Self Cite

View full text Add to dashboard Cite

This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.

show abstract

Section: E) Discussionmentioning

confidence: 99%

Section: B) Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Nishikimi

Nakamura

Goto

et al. 2021

SIP

Self Cite

View full text Add to dashboard Cite

show abstract

“…Most recently, Nakano et al [22] and Jansson et al [23] almost at the same time proposed to train the SVS task and the VME task jointly. Both methods obtained promising results.…”

Section: Source Separation-based Vocal Melody Extractionmentioning

confidence: 99%

“…According to the performance of Deep Salience reported in [22], the F0 values estimated by Deep Salience still contain errors, which limits the performance of this method to a certain extent. In [23], the authors designed a differentiable layer that converts an F0 saliency spectrogram into harmonic masks indicating the locations of harmonic partials of a singing voice. However, this system is not robust to the backing vocals, since in the SVS task the backing vocals belong to vocals but in the VME task, the pitches of backing vocals do not belong to the vocal melody.…”

Section: Source Separation-based Vocal Melody Extractionmentioning

confidence: 99%

Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation

Gao

Zhang

2021

Electronics

View full text Add to dashboard Cite

Vocal melody extraction is an important and challenging task in music information retrieval. One main difficulty is that, most of the time, various instruments and singing voices are mixed according to harmonic structure, making it hard to identify the fundamental frequency (F0) of a singing voice. Therefore, reducing the interference of accompaniment is beneficial to pitch estimation of the singing voice. In this paper, we first adopted a high-resolution network (HRNet) to separate vocals from polyphonic music, then designed an encoder-decoder network to estimate the vocal F0 values. Experiment results demonstrate that the effectiveness of the HRNet-based singing voice separation method in reducing the interference of accompaniment on the extraction of vocal melody, and the proposed vocal melody extraction (VME) system outperforms other state-of-the-art algorithms in most cases.

show abstract