Multi-channel Itakura Saito Distance Minimization with Deep Neural Network

Togami, Masahito

doi:10.1109/icassp.2019.8683410

Cited by 19 publications

(20 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The loss function for the DNN training is set to a divergence between two posterior PDFs, i.e., the posterior PDF estimated by [12] is utilized similarly to conventional supervised speech source separation [13,24], Π is a set of possible permutations, and…”

Section: Loss Function For Deep Neural Network Trainingmentioning

confidence: 99%

Unsupervised Training for Deep Speech Source Separation with Kullback-Leibler Divergence Based Probabilistic Loss Function

Togami

Masuyama

Komatsu

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this paper, we propose a multi-channel speech source separation with a deep neural network (DNN) which is trained under the condition that no clean signal is available. As an alternative to a clean signal, the proposed method adopts an estimated speech signal by an unsupervised speech source separation with a statistical model. As a statistical model of microphone input signal, we adopts a timevarying spatial covariance matrix (SCM) model which includes reverberation and background noise submodels so as to achieve robustness against reverberation and background noise. The DNN infers intermediate variables which are needed for constructing the timevarying SCM. Speech source separation is performed in a probabilistic manner so as to avoid overfitting to separation error. Since there are multiple intermediate variables, a loss function which evaluates a single intermediate variable is not applicable. Instead, the proposed method adopts a loss function which evaluates the output probabilistic signal directly based on Kullback-Leibler Divergence (KLD). Gradient of the loss function can be back-propagated into the DNN through all the intermediate variables. Experimental results under reverberant conditions show that the proposed method can train the DNN efficiently even when the number of training utterances is small, i.e., 1K.

show abstract

Section: Loss Function For Deep Neural Network Trainingmentioning

confidence: 99%

Unsupervised Training for Deep Speech Source Separation with Kullback-Leibler Divergence Based Probabilistic Loss Function

Togami

Masuyama

Komatsu

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In Section 3.1, we extend WA for applying it to our proposed system, which is defined in the time-domain. Section 3.2 describes another objective function calculated by a sum of the original multi-channel objective function [14] and a consistencyaware objective function defined in the T-F domain. Both proposed objective functions are summarized in Fig.…”

Section: Proposed Multi-channel Speech Enhancement Systemmentioning

confidence: 99%

Consistency-Aware Multi-Channel Speech Enhancement Using Deep Neural Networks

Masuyama

Togami

Komatsu

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

This paper proposes a deep neural network (DNN)-based multichannel speech enhancement system in which a DNN is trained to maximize the quality of the enhanced time-domain signal. DNNbased multi-channel speech enhancement is often conducted in the time-frequency (T-F) domain because spatial filtering can be efficiently implemented in the T-F domain. In such a case, ordinary objective functions are computed on the estimated T-F mask or spectrogram. However, the estimated spectrogram is often inconsistent, and its amplitude and phase may change when the spectrogram is converted back to the time-domain. That is, the objective function does not evaluate the enhanced time-domain signal properly. To address this problem, we propose to use an objective function defined on the reconstructed time-domain signal. Specifically, speech enhancement is conducted by multi-channel Wiener filtering in the T-F domain, and its result is converted back to the time-domain. We propose two objective functions computed on the reconstructed signal where the first one is defined in the time-domain, and the other one is defined in the T-F domain. Our experiment demonstrates the effectiveness of the proposed system comparing to T-F masking and mask-based beamforming.

show abstract

“…1. After reviewing a multichannel loss function for timevarying MWF [28], the proposed time-invariant mask-based beamforming is introduced, which is based on the same loss function used in [28]. Since the loss function focuses on the time-varying MWF, it requires the estimated time-varying activation which is redundant for time-invariant beamforming.…”

Section: Proposed Mask-based Beamforming With Multichannel Loss Functionmentioning

confidence: 99%

“…For time-varying MWF, we proposed a multichannel loss function which evaluates the estimated time-varying spatial covariance matricesR t,f,n . In [28], a DNN estimates the timevarying activation and TF-mask. Based on DNN's outputs, the time-varying spatial covariance matrices are calculated aŝ R t,f,n =v t,f,nRf,n whereR f,n is given by Eq.…”

Section: Multichannel Loss Function For Time-varying Mwf [28]mentioning

confidence: 99%

“…We first import it for time-invariant mask-based beamforming. Furthermore, since the loss function presented in [28] is redundant for the timeinvariant mask-based beamforming, we also propose the maskbased beamforming with the low-computational loss function. By using PIT, both proposed methods can be easily applied to speaker-independent multi-talker separation.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multichannel Loss Function for Supervised Speech Source Separation by Mask-Based Beamforming

2019

Self Cite

View full text Add to dashboard Cite

In this paper, we propose two mask-based beamforming methods using a deep neural network (DNN) trained by multichannel loss functions. Beamforming technique using time-frequency (TF)-masks estimated by a DNN have been applied to many applications where TF-masks are used for estimating spatial covariance matrices. To train a DNN for mask-based beamforming, loss functions designed for monaural speech enhancement/separation have been employed. Although such a training criterion is simple, it does not directly correspond to the performance of mask-based beamforming. To overcome this problem, we use multichannel loss functions which evaluate the estimated spatial covariance matrices based on the multichannel Itakura-Saito divergence. DNNs trained by the multichannel loss functions can be applied to construct several beamformers. Experimental results confirmed their effectiveness and robustness to microphone configurations.

show abstract

Multi-channel Itakura Saito Distance Minimization with Deep Neural Network

Cited by 19 publications

References 21 publications

Unsupervised Training for Deep Speech Source Separation with Kullback-Leibler Divergence Based Probabilistic Loss Function

Unsupervised Training for Deep Speech Source Separation with Kullback-Leibler Divergence Based Probabilistic Loss Function

Consistency-Aware Multi-Channel Speech Enhancement Using Deep Neural Networks

Multichannel Loss Function for Supervised Speech Source Separation by Mask-Based Beamforming

Contact Info

Product

Resources

About