Multichannel Speech Separation with Recurrent Neural Networks from High-Order Ambisonics Recordings

Perotin, Laureline; Serizel, Romain; Guérin, Alexandre

doi:10.1109/icassp.2018.8461370

Cited by 35 publications

(21 citation statements)

References 20 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this context, it is important to know the directions of arrival (DoAs) of the sounds, in order either to enhance the signals of interest or to reproduce the sound scene properly. For instance, DoA estimation is essential for speech enhancement and robust far-field automatic speech recognition in scenarios involving overlapping speakers [1]- [5].…”

Section: Introductionmentioning

confidence: 99%

CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings

Perotin

Serizel

Vincent

et al. 2019

IEEE J. Sel. Top. Signal Process.

Self Cite

117

126

View full text Add to dashboard Cite

Localizing audio sources is challenging in real reverberant environments, especially when several sources are active. We propose to use a neural network built from stacked convolutional and recurrent layers in order to estimate the directions of arrival of multiple sources from a first-order Ambisonics recording. It returns the directions of arrival over a discrete grid of a known number of sources. We propose to use features derived from the acoustic intensity vector as inputs. We analyze the behavior of the neural network by means of a visualization technique called layerwise relevance propagation. This analysis highlights which parts of the input signal are relevant in a given situation. We also conduct experiments to evaluate the performance of our system in various environments, from simulated rooms to real recordings, with one or two speech sources. The results show that the proposed features significantly improve performances with respect to raw Ambisonics inputs.

show abstract

Section: Introductionmentioning

confidence: 99%

CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings

Perotin

Serizel

Vincent

et al. 2019

IEEE J. Sel. Top. Signal Process.

Self Cite

117

126

View full text Add to dashboard Cite

show abstract

“…Heymann et al predicted TF masks out of a single signal of the microphone array [16]. Perotin et al [22] or Chakrabarty and Habets [21] included several other signals to improve the speech recognition or speech enhancement performance. We propose to extend these scenarios to the multi-node context of DANSE.…”

Section: Deep Neural Network Based Distributed Multichannel Wiener Fimentioning

confidence: 99%

“…This yields better results than single-channel prediction but combining all the sensor signals is not scalable and seems suboptimal because of the redundancy of the data. Coping with the redundancy, Perotin et al [22] combined a single estimate of the source signals with the input mixture and used the resulting tensor to train a long short-term memory (LSTM) recurrent neural network (RNN).…”

Section: Introductionmentioning

confidence: 99%

“…This allows for using the MWF-based DANSE algorithm which was reported to achieve good speech enhancement performance [9]. Following the results shown by Perotin et al [22], we take advantage of the DANSE paradigm [9] by combining at each node one local signal with the estimations of the target signal sent by the other nodes. This uses the multichannel context for the mask estimation but avoids the redundancy brought by the signals of a same node.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DNN-based Distributed Multichannel Mask Estimation for Speech Enhancement in Microphone Arrays

Furnon

Serizel

Illina

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Multichannel processing is widely used for speech enhancement but several limitations appear when trying to deploy these solutions in the real world. Distributed sensor arrays that consider several devices with a few microphones is a viable solution which allows for exploiting the multiple devices equipped with microphones that we are using in our everyday life. In this context, we propose to extend the distributed adaptive node-specific signal estimation approach to a neural network framework. At each node, a local filtering is performed to send one signal to the other nodes where a mask is estimated by a neural network in order to compute a global multichannel Wiener filter. In an array of two nodes, we show that this additional signal can be leveraged to predict the masks and leads to better speech enhancement performance than when the mask estimation relies only on the local signals.

show abstract

“…Recently, speech enhancement is advanced by the use of a deep neural network (DNN) to estimate a T-F mask. For effectively modelling a speech signal which is timesequential data, a recurrent neural network (RNN) is used in various speech signal processing applications [1][2][3][4][5][6][7][8][9][10][11][12][13][14].…”

Section: Introductionmentioning

confidence: 99%

Real-Time Speech Enhancement Using Equilibriated RNN

Takeuchi

Yatabe

Oikawa

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose a speech enhancement method using a causal deep neural network (DNN) for real-time applications. DNN has been widely used for estimating a time-frequency (T-F) mask which enhances a speech signal. One popular DNN structure for that is a recurrent neural network (RNN) owing to its capability of effectively modelling time-sequential data like speech. In particular, the long short-term memory (LSTM) is often used to alleviate the vanishing/exploding gradient problem which makes the training of an RNN difficult. However, the number of parameters of LSTM is increased as the price of mitigating the difficulty of training, which requires more computational resources. For real-time speech enhancement, it is preferable to use a smaller network without losing the performance. In this paper, we propose to use the equilibriated recurrent neural network (ERNN) for avoiding the vanishing/exploding gradient problem without increasing the number of parameters. The proposed structure is causal, which requires only the information from the past, in order to apply it in real-time. Compared to the uni-and bi-directional LSTM networks, the proposed method achieved the similar performance with much fewer parameters.Index Terms-Real-time speech enhancement, equiribriated recurrent neural network, vanishing/exploding gradient problem.

show abstract

Multichannel Speech Separation with Recurrent Neural Networks from High-Order Ambisonics Recordings

Cited by 35 publications

References 20 publications

CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings

CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings

DNN-based Distributed Multichannel Mask Estimation for Speech Enhancement in Microphone Arrays

Real-Time Speech Enhancement Using Equilibriated RNN

Contact Info

Product

Resources

About