A Vocoder Based Method for Singing Voice Extraction

Chandna, Pankaj; Blaauw, Merlijn; Bonada, Jordi

doi:10.1109/icassp.2019.8683323

Cited by 8 publications

(11 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also observe that all three vocoder based models outperform the mask based model, UNET on isolating the vocal signal from the mixture. This follows from our previous observation [13] and is due to the approach that we follow of re-synthesising the vocal signal from the mixture rather than using a mask over the input signal. The models still lag behind on audio quality.…”

Section: Subjective Evaluationmentioning

confidence: 78%

“…We use this prediction along with the vocoder features to synthesise the audio signal. We tried both the discrete representation of the fundamental frequency as described in [23] and a continuous representation, normalised to the range 0 to 1 as used in [13] and found that while the discrete representation leads to slightly higher accuracy in the output, the continuous representation produces a pitch contour perceptually more suitable for synthesis of the signal. Fig.…”

Section: Methodsmentioning

confidence: 99%

“…Instead we use a subjective listening test for evaluation [27]. We evaluate both the singer dependent network, SDN and the singer independent network, SIN, against a baseline [7], referred to as UNET and against our previously proposed model [13], SS. We adjusted the U-Net architecture to account for the change in the sampling frequency and the window size used in our model.…”

Section: Evaluation Methodologymentioning

confidence: 99%

“…The short time fourier transform (STFT) was calculated with a Hanning window of size 1024. Both the STFT and the vocoder features were calculated with a hoptime of 5 ms. We use dimensionality reduction for the vocoder features [24,25,13], leading to 64 features.…”

Section: Datasetmentioning

confidence: 99%

“…The stem is the raw audio signal with linear and non-linear effects such as reverb, compression, delay and equalization among others. We recently proposed a model [13], which resynthesises the underlying raw vocal signal present in a musical signal by using a deep neural network architecture as a function approximator that predicts vocoder features pertaining to the vocal signal. The model outperformed state-of-theart source separation algorithms in terms of isolation of the signal from the backing track.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Content Based Singing Voice Extraction from a Musical Mixture

Chandna

Blaauw

Bonada

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We present a deep learning based methodology for extracting the singing voice signal from a musical mixture based on the underlying linguistic content. Our model follows an encoder-decoder architecture and takes as input the magnitude component of the spectrogram of a musical mixture with vocals. The encoder part of the model is trained via knowledge distillation using a teacher network to learn a content embedding, which is decoded to generate the corresponding vocoder features. Using this methodology, we are able to extract the unprocessed raw vocal signal from the mixture even for a processed mixture dataset with singers not seen during training. While the nature of our system makes it incongruous with traditional objective evaluation metrics, we use subjective evaluation via listening tests to compare the methodology to state-of-the-art deep learning based source separation algorithms. We also provide sound examples and source code for reproducibility.

show abstract

Section: Subjective Evaluationmentioning

confidence: 78%

Section: Methodsmentioning

confidence: 99%

Section: Evaluation Methodologymentioning

confidence: 99%

Section: Datasetmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Content Based Singing Voice Extraction from a Musical Mixture

Chandna

Blaauw

Bonada

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

OveNet: A Hyper-Range U-Net for Singing Voice Separation

Lee

Soo

2019

2019 IEEE International Symposium on Multimedia (ISM)

View full text Add to dashboard Cite

Deep Learning Approaches in Topics of Singing Information Processing

Gupta

Goto

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Singing, the vocal production of musical tones, is one of the most important elements of music. Addressing the needs of real-world applications, the study of technologies related to singing voices has become an increasingly active area of research. In this paper, we provide a comprehensive overview of the recent developments in the field of singing information processing, specifically in the topics of singing skill evaluation, singing voice synthesis, singing voice separation, and lyrics synchronization and transcription. We will especially focus on deep learning approaches including modern representation learning techniques for singing voices. We will also provide an overview of contributions in public datasets for singing voice research.

show abstract

A Vocoder Based Method for Singing Voice Extraction

Cited by 8 publications

References 13 publications

Content Based Singing Voice Extraction from a Musical Mixture

Content Based Singing Voice Extraction from a Musical Mixture

OveNet: A Hyper-Range U-Net for Singing Voice Separation

Deep Learning Approaches in Topics of Singing Information Processing

Contact Info

Product

Resources

About