ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683323
|View full text |Cite
|
Sign up to set email alerts
|

A Vocoder Based Method for Singing Voice Extraction

Abstract: This paper presents a novel method for extracting the vocal track from a musical mixture. The musical mixture consists of a singing voice and a backing track which may comprise of various instruments. We use a convolutional network with skip and residual connections as well as dilated convolutions to estimate vocoder parameters, given the spectrogram of an input mixture. The estimated parameters are then used to synthesize the vocal track, without any interference from the backing track. We evaluate our system… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 8 publications
(11 citation statements)
references
References 13 publications
0
11
0
Order By: Relevance
“…We also observe that all three vocoder based models outperform the mask based model, UNET on isolating the vocal signal from the mixture. This follows from our previous observation [13] and is due to the approach that we follow of re-synthesising the vocal signal from the mixture rather than using a mask over the input signal. The models still lag behind on audio quality.…”
Section: Subjective Evaluationmentioning
confidence: 78%
See 4 more Smart Citations
“…We also observe that all three vocoder based models outperform the mask based model, UNET on isolating the vocal signal from the mixture. This follows from our previous observation [13] and is due to the approach that we follow of re-synthesising the vocal signal from the mixture rather than using a mask over the input signal. The models still lag behind on audio quality.…”
Section: Subjective Evaluationmentioning
confidence: 78%
“…We use this prediction along with the vocoder features to synthesise the audio signal. We tried both the discrete representation of the fundamental frequency as described in [23] and a continuous representation, normalised to the range 0 to 1 as used in [13] and found that while the discrete representation leads to slightly higher accuracy in the output, the continuous representation produces a pitch contour perceptually more suitable for synthesis of the signal. Fig.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations