Single channel speech music separation using nonnegative matrix factorization and spectral masks

Grais, Emad M.; Erdoğan, Hakan

doi:10.1109/icdsp.2011.6004924

Cited by 71 publications

(53 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Weights for basis vectors appear in corresponding columns in matrix W. To approximate data in V as a non-negative linear combination of its component vectors, the non-negative basis vectors in matrix B are optimized. The matrices B and W are estimated by solving following optimization problem [7]:…”

Section: A Training Of Speech and Musicmentioning

confidence: 99%

“…With two sets of training data for speech and music signals, the Fast Fourier Transform (FFT) is computed for each signal to obtain magnitude spectrogram of speech and music signals. Then, NMF is used for decompose speech and music spectrograms into base and weight matrices In other word, the aim of using NMF is to model the training data as a set of basis vectors to represent the spectral characteristics for each source signal [7]. ≈ ℎ ℎ…”

Section: A Training Of Speech and Musicmentioning

confidence: 99%

“…and matrices contain estimations for the magnitude spectral frames of the music and speech signals [7].…”

Section: B Decomposition Of the Mixed Signalmentioning

confidence: 99%

See 2 more Smart Citations

Speech/music separation using non-negative matrix factorization with combination of cost functions

Nasersharif

Abdali

2015

2015 the International Symposium on Artificial Intelligence and Signal Processing (AISP)

View full text Add to dashboard Cite

A solution for separating speech from music signal as a single channel source separation is Non-negative Matrix Factorization (NMF). In this approach spectrogram of each source signal is factorized as multiplication of two matrices which are known as basis and weight matrices. To achieve proper estimation of signal spectrogram, weight and basis matrices are updated iteratively. To estimate distance between signal and its estimation a cost function is used usually. Different cost functions have been introduced based on Kullback-Leibler (KL) and Itakura-Saito (IS) divergences. IS divergence is scale-invariant and so it is suitable for the conditions in which the coefficients of signal have a large dynamic range, for example in music shortterm spectra. Based on this IS property, in this paper, we propose to use IS divergence as cost function of NMF in the training stage for music and on the other hand we suggest to use KL divergence as NMF cost function in the training stage for speech. Moreover, in the decomposition stage, we propose to use a linear combination of these two divergences in addition to a regularization term which considers temporal continuity information as a prior knowledge. Experimental results on one hour of speech and music, shows a good trade-off between signal to inference ratio (SIR) of speech and music in comparison to conventional NMF methods.

show abstract

Section: A Training Of Speech and Musicmentioning

confidence: 99%

Section: A Training Of Speech and Musicmentioning

confidence: 99%

See 1 more Smart Citation

Speech/music separation using non-negative matrix factorization with combination of cost functions

Nasersharif

Abdali

2015

2015 the International Symposium on Artificial Intelligence and Signal Processing (AISP)

View full text Add to dashboard Cite

show abstract

“…For comparison to the DNN approach, an equivalent non-negative matrix factorization (NMF) based approach was implemented using the same training and test data (as described above). We used the same unpacking strategy, which has been tested before for NMF-based separation of speech and music [13]. The spectrograms of the training data were sampled and unpacked analogously to the DNN approach, resulting in a large (220500x15000) matrix that was then decomposed using the traditional multiplicative updates algorithm with KL divergence [14].…”

Section: Methodsmentioning

confidence: 99%

“…For the testing stage, we concatenated both matrices and initialized a corresponding H u matrix randomly, so that for each unpacked spectrogram, V u , of the set of test songs, V u = [W v W nv ] H u . We then ran the same multiplicative updates algorithm but keeping the composite W u matrix fixed [13], and updating H u . The test spectrogram was then re-composed for either vocal (V v = W v H v ) or non-vocal (V nv = W nv H nv ) vectors, and used to define a soft mask via the element-wise division…”

Section: Methodsmentioning

confidence: 99%

Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network

Simpson

Roma

Plumbley

2015

Latent Variable Analysis and Signal Separation

View full text Add to dashboard Cite

Abstract-Identification and extraction of singing voice from within musical mixtures is a key challenge in source separation and machine audition. Recently, deep neural networks (DNN) have been used to estimate 'ideal' binary masks for carefully controlled cocktail party speech separation problems. However, it is not yet known whether these methods are capable of generalizing to the discrimination of voice and non-voice in the context of musical mixtures. Here, we trained a convolutional DNN (of around a billion parameters) to provide probabilistic estimates of the ideal binary mask for separation of vocal sounds from real-world musical mixtures. We contrast our DNN results with more traditional linear methods. Our approach may be useful for automatic removal of vocal sounds from musical mixtures for 'karaoke' type applications.

show abstract

Tonal Analysis

2022

An Introduction to Audio Content Analysis

View full text Add to dashboard Cite

Single channel speech music separation using nonnegative matrix factorization and spectral masks

Cited by 71 publications

References 7 publications

Speech/music separation using non-negative matrix factorization with combination of cost functions

Speech/music separation using non-negative matrix factorization with combination of cost functions

Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network

Tonal Analysis

Contact Info

Product

Resources

About