2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP) 2017
DOI: 10.1109/mlsp.2017.8168117
|View full text |Cite
|
Sign up to set email alerts
|

A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation

Abstract: The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
34
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
3
3
1

Relationship

2
5

Authors

Journals

citations
Cited by 25 publications
(34 citation statements)
references
References 21 publications
0
34
0
Order By: Relevance
“…2. The Masker consists of a bi-directional recurrent neural network (Bi-RNN), the RNN encoder (RNNenc), an RNN decoder (RNNdec), a sparsifying transform that is implemented by a feed-forward neural network (FNN), with shared weights through time, followed by a rectified linear unit (ReLU), and the skip-filtering connections [16]. The input to the Masker is a Vx and the output of the skip-filtering connections is a first estimate of the singing voice spectrogram denotedV 1 .…”
Section: Mad Twinnetmentioning
confidence: 99%
See 2 more Smart Citations
“…2. The Masker consists of a bi-directional recurrent neural network (Bi-RNN), the RNN encoder (RNNenc), an RNN decoder (RNNdec), a sparsifying transform that is implemented by a feed-forward neural network (FNN), with shared weights through time, followed by a rectified linear unit (ReLU), and the skip-filtering connections [16]. The input to the Masker is a Vx and the output of the skip-filtering connections is a first estimate of the singing voice spectrogram denotedV 1 .…”
Section: Mad Twinnetmentioning
confidence: 99%
“…SinceV 1 is expected to contain interference from other music sources [16,17], the Denoiser aims at further enhancing the estimate of the Masker. A denoising filter is learned and applied to the estimate of the Masker,V 1 .…”
Section: Mad Twinnetmentioning
confidence: 99%
See 1 more Smart Citation
“…where is the Hadamard product. V 1 is expected to contain interferences from other music sources [17,18]. Therefore, MaD utilizes another module, the Denoiser, on top of the Masker, which consists of two feed-forward layers denoted FNNenc and FNNdec.…”
Section: The Masker and The Denoisermentioning
confidence: 99%
“…However, state-of-the-art results for source separation are obtained with deep learning methods [15,16], which learn the model from the given data. They have shown particularly successful for the task of singing voice separation [17,18,19]. Recently, a DNN-based HPSS method has been introduced [20] and is based on learning a set of convolution kernels that perform the separation.…”
Section: Introductionmentioning
confidence: 99%