Joint Singing Voice Separation and F0 Estimation with Deep U-Net Architectures

Jansson, Andreas; Bittner, Rachel M.; Ewert, Sebastian; Weyde, Tillman

doi:10.23919/eusipco.2019.8902550

Cited by 34 publications

(44 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use this prediction along with the vocoder features to synthesise the audio signal. We tried both the discrete representation of the fundamental frequency as described in [23] and a continuous representation, normalised to the range 0 to 1 as used in [13] and found that while the discrete representation leads to slightly higher accuracy in the output, the continuous representation produces a pitch contour perceptually more suitable for synthesis of the signal. Fig.…”

Section: Methodsmentioning

confidence: 99%

Content Based Singing Voice Extraction from a Musical Mixture

Chandna

Blaauw

Bonada

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We present a deep learning based methodology for extracting the singing voice signal from a musical mixture based on the underlying linguistic content. Our model follows an encoder-decoder architecture and takes as input the magnitude component of the spectrogram of a musical mixture with vocals. The encoder part of the model is trained via knowledge distillation using a teacher network to learn a content embedding, which is decoded to generate the corresponding vocoder features. Using this methodology, we are able to extract the unprocessed raw vocal signal from the mixture even for a processed mixture dataset with singers not seen during training. While the nature of our system makes it incongruous with traditional objective evaluation metrics, we use subjective evaluation via listening tests to compare the methodology to state-of-the-art deep learning based source separation algorithms. We also provide sound examples and source code for reproducibility.

show abstract

Section: Methodsmentioning

confidence: 99%

Content Based Singing Voice Extraction from a Musical Mixture

Chandna

Blaauw

Bonada

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Most recently, Nakano et al [22] and Jansson et al [23] almost at the same time proposed to train the SVS task and the VME task jointly. Both methods obtained promising results.…”

Section: Source Separation-based Vocal Melody Extractionmentioning

confidence: 99%

“…Both methods obtained promising results. In [22], a joint U-Net model stacking SVS and VME was proposed. However, limited by the size of datasets containing both pure vocal tracks and their corresponding F0 annotations, the authors used a large internal dataset where reference F0 values were annotated by the VME method Deep Salience [5].…”

Section: Source Separation-based Vocal Melody Extractionmentioning

confidence: 99%

Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation

Gao

Zhang

2021

Electronics

View full text Add to dashboard Cite

Vocal melody extraction is an important and challenging task in music information retrieval. One main difficulty is that, most of the time, various instruments and singing voices are mixed according to harmonic structure, making it hard to identify the fundamental frequency (F0) of a singing voice. Therefore, reducing the interference of accompaniment is beneficial to pitch estimation of the singing voice. In this paper, we first adopted a high-resolution network (HRNet) to separate vocals from polyphonic music, then designed an encoder-decoder network to estimate the vocal F0 values. Experiment results demonstrate that the effectiveness of the HRNet-based singing voice separation method in reducing the interference of accompaniment on the extraction of vocal melody, and the proposed vocal melody extraction (VME) system outperforms other state-of-the-art algorithms in most cases.

show abstract

“…Therefore, it is essential to accurately estimate the Wiener gain in each time‐frequency slot using a method such as time‐frequency masking. Approaches to directly estimate Wiener gain have also been developed, which use deep learning techniques to model a mapping function from the mixed sound signals into time‐frequency masks, and deep networks pretrained with target sound‐source signals . On the other hand, deep clustering has been proposed as an approach for estimating not the time‐frequency mask, but time‐frequency embedding vectors, so that the embedding vectors for time‐frequency slot pairs dominated by the same sound‐source signal are close together, while those for others signals are further away.…”

Section: Recent Research Trends In Environmental Sound Processingmentioning

confidence: 99%

Environmental sound processing and its applications

Miyazaki

Toda

Hayashi

et al. 2019

IEEJ Transactions Elec Engng

View full text Add to dashboard Cite

As part of the effort to develop techniques for understanding environments using sound, many studies in the field of computational auditory scene analysis have focused on using computers to perform functions carried out naturally by the human auditory system. Thanks to recent progress in machine‐learning techniques, these environmental sound‐processing techniques have significantly improved and a widening variety of applications has resulted in considerable interest in this field. In this review, we introduce the fundamental techniques of environmental sound processing, as well as recent advances in front‐end and back‐end processing and potential applications for these techniques. Prospects for further progress in the field of environmental sound processing and the challenges still to be overcome are also discussed. © 2019 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.

show abstract

Joint Singing Voice Separation and F0 Estimation with Deep U-Net Architectures

Cited by 34 publications

References 13 publications

Content Based Singing Voice Extraction from a Musical Mixture

Content Based Singing Voice Extraction from a Musical Mixture

Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation

Environmental sound processing and its applications

Contact Info

Product

Resources

About