A neural network based algorithm for speaker localization in a multi-room environment

Vesperini, Fabio; Vecchiotti, Paolo; Principi, Emanuele; Squartini, Stefano; Piazza, Francesco

doi:10.1109/mlsp.2016.7738817

Cited by 64 publications

(63 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Further, methods [4,18,20,25] proposed to simultaneously detect DOAs of overlapping sound events by estimating the number of active sources from the data itself. Most methods used a classification approach, thereby estimating the source presence likelihood at a fixed set of angles, while [22,23] used a regression approach and let the DNN produce continuous output.…”

Section: B Sound Source Localizationmentioning

confidence: 99%

“…Spectral power azi (Full) for each class Multiple CNN Circular Yiwere et al [21] ILD, cross-correlation azi and dist 1 FC Binaural × Ferguson et al [22] GCC, cepstrogram azi and dist (regression) 1 CNN Linear × Vesperini et al [23] GCC x and y (regression) 1 FC Distributed × Sun et al [24] GCC azi and ele 1 PNN Cartesian × Adavanne et al [25] Phase and magnitude spectrum azi and ele (Full) Multiple CRNN Generic ×…”

mentioning

confidence: 99%

See 1 more Smart Citation

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Adavanne¹,

Politis²,

Nikunen³

et al. 2019

IEEE J. Sel. Top. Signal Process.

378

View full text Add to dashboard Cite

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in threedimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-ofarrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method-and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

show abstract

Section: B Sound Source Localizationmentioning

confidence: 99%

mentioning

confidence: 99%

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Adavanne¹,

Politis²,

Nikunen³

et al. 2019

IEEE J. Sel. Top. Signal Process.

378

View full text Add to dashboard Cite

show abstract

“…By combining information from multiple microphone arrays, directions can be merged to obtain source locations. Given a microphone array signal from multiple microphones, direction estimation can be formulated in two ways: 1) by forming a fixed grid of possible directions, and by using multilabel classification to predict if there is an active source in a specific direction [115], or 2) by using regression to predict the directions [116] or spatial coordinates [117] of target sources. In addition to this categorization, differences in various deep learning methods for localization lie in the input features used, the network topology, and whether one or more sources are localized.…”

Section: Applicationsmentioning

confidence: 99%

Deep Learning for Audio Signal Processing

Purwins

Virtanen

et al. 2019

IEEE J. Sel. Top. Signal Process.

572

247

View full text Add to dashboard Cite

Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered sideby-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.

show abstract

“…To improve the robustness of DOA estimation, deep neural networks (DNNs) have been proposed to learn a mapping between signal features and a discretized DOA space [17][18][19][20][21]. Various features such as phasemaps [17,18] and GCC-PHAT [21] have been used as inputs.…”

Section: Introductionmentioning

confidence: 99%

“…Various features such as phasemaps [17,18] and GCC-PHAT [21] have been used as inputs. In [22], the cosines and sines of the frequency-wise phase differences between microphones, termed as cosine-sine interchannel phase difference (CSIPD) features, have been shown to perform as well as phasemaps for DOA estimation, despite their lower dimensionality.…”

Section: Introductionmentioning

confidence: 99%

Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment

Sivasankaran¹,

Fohr²

2018

Interspeech 2018

View full text Add to dashboard Cite

To cite this version:Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr. Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment. Interspeech 2018 -19th AbstractSpeaker localization is a hard task, especially in adverse environmental conditions involving reverberation and noise. In this work we introduce the new task of localizing the speaker who uttered a given keyword, e.g., the wake-up word of a distantmicrophone voice command system, in the presence of overlapping speech. We employ a convolutional neural network based localization system and investigate multiple identifiers as additional inputs to the system in order to characterize this speaker.We conduct experiments using ground truth identifiers which are obtained assuming the availability of clean speech and also in realistic conditions where the identifiers are computed from the corrupted speech. We find that the identifier consisting of the ground truth time-frequency mask corresponding to the target speaker provides the best localization performance and we propose methods to estimate such a mask in adverse reverberant and noisy conditions using the considered keyword.

show abstract

A neural network based algorithm for speaker localization in a multi-room environment

Cited by 64 publications

References 21 publications

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Deep Learning for Audio Signal Processing

Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment

Contact Info

Product

Resources

About