Abstract:supervised learning with deep neural networks for relative transfer function inverse regression.
ABSTRACTPrior knowledge of the relative transfer function (RTF) is useful in many applications but remains little studied. In this paper, we propose a semi-supervised learning algorithm based on deep neural networks (DNNs) for RTF inverse regression, that is to generate the full-band RTF vector directly from the source-receiver pose (position and orientation). Two typical scenarios are discussed: training on labele… Show more
“…The phase features are closely related to IPD, which are thus further activated by sine and cosine functions as is done in Eq. (10). The phase branch is also followed by a convolutional layer with 64 3 × 3 kernels, a BN and a ReLU activation function.…”
Section: Dp-rtf Learningmentioning
confidence: 99%
“…The complementarity of the two types of difference features contributes the fusion of time and intensity difference information. A typical fused feature is relative transfer function (RTF) [9], [10] which encodes time and intensity difference in its argument and magnitude respectively.…”
Direct-path relative transfer function (DP-RTF) refers to the ratio between the direct-path acoustic transfer functions of two microphone channels. Though DP-RTF fully encodes the sound spatial cues and serves as a reliable localization feature, it is often erroneously estimated in the presence of noise and reverberation. This paper proposes to learn DP-RTF with deep neural networks for robust binaural sound source localization. A DP-RTF learning network is designed to regress the binaural sensor signals to a real-valued representation of DP-RTF. It consists of a branched convolutional neural network module to separately extract the inter-channel magnitude and phase patterns, and a convolutional recurrent neural network module for joint feature learning. To better explore the speech spectra to aid the DP-RTF estimation, a monaural speech enhancement network is used to recover the direct-path spectrograms from the noisy ones. The enhanced spectrograms are stacked onto the noisy spectrograms to act as the input of the DP-RTF learning network. We train one unique DP-RTF learning network using many different binaural arrays to enable the generalization of DP-RTF learning across arrays. This way avoids time-consuming training data collection and network retraining for a new array, which is very useful in practical application. Experimental results on both simulated and real-world data show the effectiveness of the proposed method for direction of arrival (DOA) estimation in the noisy and reverberant environment, and a good generalization ability to unseen binaural arrays.
“…The phase features are closely related to IPD, which are thus further activated by sine and cosine functions as is done in Eq. (10). The phase branch is also followed by a convolutional layer with 64 3 × 3 kernels, a BN and a ReLU activation function.…”
Section: Dp-rtf Learningmentioning
confidence: 99%
“…The complementarity of the two types of difference features contributes the fusion of time and intensity difference information. A typical fused feature is relative transfer function (RTF) [9], [10] which encodes time and intensity difference in its argument and magnitude respectively.…”
Direct-path relative transfer function (DP-RTF) refers to the ratio between the direct-path acoustic transfer functions of two microphone channels. Though DP-RTF fully encodes the sound spatial cues and serves as a reliable localization feature, it is often erroneously estimated in the presence of noise and reverberation. This paper proposes to learn DP-RTF with deep neural networks for robust binaural sound source localization. A DP-RTF learning network is designed to regress the binaural sensor signals to a real-valued representation of DP-RTF. It consists of a branched convolutional neural network module to separately extract the inter-channel magnitude and phase patterns, and a convolutional recurrent neural network module for joint feature learning. To better explore the speech spectra to aid the DP-RTF estimation, a monaural speech enhancement network is used to recover the direct-path spectrograms from the noisy ones. The enhanced spectrograms are stacked onto the noisy spectrograms to act as the input of the DP-RTF learning network. We train one unique DP-RTF learning network using many different binaural arrays to enable the generalization of DP-RTF learning across arrays. This way avoids time-consuming training data collection and network retraining for a new array, which is very useful in practical application. Experimental results on both simulated and real-world data show the effectiveness of the proposed method for direction of arrival (DOA) estimation in the noisy and reverberant environment, and a good generalization ability to unseen binaural arrays.
“…Moreover, inter-channel intensity difference (IID) is computed as the energy ratio of the signals captured by two microphones. Relative transfer function (RTF) [10,11] encodes time and intensity information in its argument and magnitude respectively, which is the ratio between the acoustic transfer functions of the two channels. Other high-level localization features include the cross-correlation function (CCF) [3], the eigen vectors of spatial correlation matrix associated with signal subspace [12], and so forth.…”
This article proposes a deep neural network (DNN)-based direct-path relative transfer function (DP-RTF) enhancement method for robust direction of arrival (DOA) estimation in noisy and reverberant environments. The DP-RTF refers to the ratio between the directpath acoustic transfer functions of the two microphone channels. First, the complex-value DP-RTF is decomposed into the inter-channel intensity difference, and sinusoidal functions of the inter-channel phase difference in the time-frequency domain. Then, the decomposed DP-RTF features from a series of temporal context frames are utilized to train a DNN model, which maps the DP-RTF features contaminated by noise and reverberation to the clean ones, and meanwhile provides a time-frequency (TF) weight to indicate the reliability of the mapping. The DP-RTF enhancement network can help to enhance the DP-RTF against noise and reverberation. Finally, the DOA of a sound source can be estimated by integrating the weighted matching between the enhanced DP-RTF features and the DP-RTF templates. Experimental results on simulated data show the superiority of the proposed DP-RTF enhancement network for estimating the DOA of the sound source in the environments with various levels of noise and reverberation.This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
“…Under the dual-stage localization framework, deep neural network (DNN) can be used to either extract localization features [3,4], or build the mapping from the localization features to source location [5,6]. Commonly used localization feature includes inter-channel time difference (ITD) [7], inter-channel phase difference (IPD) [8], inter-channel intensity difference (IID), relative transfer function (RTF) [9,10], etc. The source can be easily localized with aforementioned localization features under a noisefree and anechoic condition.…”
Direct-path relative transfer function (DP-RTF) refers to the ratio between the direct-path acoustic transfer functions of two channels. Though DP-RTF fully encodes the sound directional cues and serves as a reliable localization feature, it is often erroneously estimated in the presence of noise and reverberation. This paper proposes a supervised DP-RTF learning method with deep neural networks for robust binaural sound source localization. To exploit the complementarity of single-channel spectrogram and dual-channel difference information, we first recover the direct-path magnitude spectrogram from the contaminated one using a monaural enhancement network, and then predict the DP-RTF from the dual-channel (enhanced-) intensity and phase cues using a binaural enhancement network. In addition, a weighted-matching softmax training loss is designed to promote the predicted DP-RTFs to be concentrated for the same direction and separated for different directions. Finally, the direction of arrival (DOA) of source is estimated by matching the predicted DP-RTF with the ground truths of candidate directions. Experimental results show the superiority of our method for DOA estimation in the environments with various levels of noise and reverberation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.