Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals

Chakrabarty, Soumitro; Habets, Emanuël A. P.

doi:10.1109/jstsp.2019.2901664

Cited by 266 publications

(244 citation statements)

References 29 publications

Supporting

Mentioning

241

Contrasting

Unclassified

Order By: Relevance

“…Commonly used input features that have been used for deep learning based localization include phase spectrum [115], magnitude spectrum [118], and generalized cross-correlation between channels [117]. In general, source localization requires the use of interchannel information, which can also be learned by a deep neural network with a suitable topology from within-channel features, for example by convolutional layers [118] where the kernels span multiple channels.…”

Section: Applicationsmentioning

confidence: 99%

Deep Learning for Audio Signal Processing

Purwins

Virtanen

et al. 2019

IEEE J. Sel. Top. Signal Process.

579

247

View full text Add to dashboard Cite

Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered sideby-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.

show abstract

Section: Applicationsmentioning

confidence: 99%

Deep Learning for Audio Signal Processing

Purwins

Virtanen

et al. 2019

IEEE J. Sel. Top. Signal Process.

579

247

View full text Add to dashboard Cite

show abstract

“…The performance of the proposed algorithm is compared with a recent CNN-based DOA estimation method proposed in [54] (subsequently denoted as "CNN-PH") where it was already shown that "CNN-PH" outperforms conventional parametric methods like MUSIC and SRP-PHAT. For a fair comparison, we kept the CNN architecture and other evaluation criteria same in all possible ways.…”

Section: B Baseline Methods and Evaluation Metricsmentioning

confidence: 99%

“…On the contrary, Adavanne et al considered both magnitude and phase information of the STFT coefficients and used consecutive time frames to form the feature snapshot to train a convolutional recurrent neural network (CRNN) and performed a joint sound event detection and localization [55]. Both [54] and [55] require the model to be trained for unique combinations of sound sources from different angles in order to accurately estimate the DOA of simultaneously active multiple sound sources.…”

Section: A Literature Reviewmentioning

confidence: 99%

Multi-Source DOA Estimation Through Pattern Recognition of the Modal Coherence of a Reverberant Soundfield

Fahim

Samarasinghe

Abhayapala

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We propose a novel multi-source direction of arrival (DOA) estimation technique using a convolutional neural network algorithm which learns the modal coherence patterns of an incident soundfield through measured spherical harmonic coefficients. We train our model for individual time-frequency bins in the short-time Fourier transform spectrum by analyzing the unique snapshot of modal coherence for each desired direction. The proposed method is capable of estimating simultaneously active multiple sound sources on a 3D space using a single-source training scheme. This single-source training scheme reduces the training time and resource requirements as well as allows the reuse of the same trained model for different multi-source combinations. The method is evaluated against various simulated and practical noisy and reverberant environments with varying acoustic criteria and found to outperform the baseline methods in terms of DOA estimation accuracy. Furthermore, the proposed algorithm allows independent training of azimuth and elevation during a full DOA estimation over 3D space which significantly improves its training efficiency without affecting the overall estimation accuracy.

show abstract

“…Binaural cues are employed in [7], where the cross-correlation function (CCF) was used as features in a DNN to estimate the azimuth of a sound source with simulated head movement. CNN architectures were also used in [8,9] using frequency-domain features such as the phase or the magnitude of the signal.…”

Section: Introductionmentioning

confidence: 99%

End-to-end Binaural Sound Localisation from the Raw Waveform

Vecchiotti

Squartini

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

A novel end-to-end binaural sound localisation approach is proposed which estimates the azimuth of a sound source directly from the waveform. Instead of employing hand-crafted features commonly employed for binaural sound localisation, such as the interaural time and level difference, our end-to-end system approach uses a convolutional neural network (CNN) to extract specific features from the waveform that are suitable for localisation. Two systems are proposed which differ in the initial frequency analysis stage. The first system is auditory-inspired and makes use of a gammatone filtering layer, while the second system is fully data-driven and exploits a trainable convolutional layer to perform frequency analysis. In both systems, a set of dedicated convolutional kernels are then employed to search for specific localisation cues, which are coupled with a localisation stage using fully connected layers. Localisation experiments using binaural simulation in both anechoic and reverberant environments show that the proposed systems outperform a state-ofthe-art deep neural network system. Furthermore, our investigation of the frequency analysis stage in the second system suggests that the CNN is able to exploit different frequency bands for localisation according to the characteristics of the reverberant environment.

show abstract

Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals

Cited by 266 publications

References 29 publications

Deep Learning for Audio Signal Processing

Deep Learning for Audio Signal Processing

Multi-Source DOA Estimation Through Pattern Recognition of the Modal Coherence of a Reverberant Soundfield

End-to-end Binaural Sound Localisation from the Raw Waveform

Contact Info

Product

Resources

About