The Cone of Silence: Speech Separation by Localization

Jenrungrot, Teerapat; Jayaram, Vivek; Seitz, Steve; Kemelmacher-Shlizerman, Ira

doi:10.48550/arxiv.2010.06007

Cited by 5 publications

(12 citation statements)

References 54 publications

(72 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[53], [54], [55]. Speaker localization, diarization and (speech) source separation are intrinsically connected problems, as the information retrieved from solving each one of them can be useful for addressing the others [23], [56], [57]. An investigation of those connections is out of the scope of the present survey.…”

Section: Number Of Sourcesmentioning

confidence: 99%

“…This spectral mask is finally applied for source separation. Another joint localization and separation system based on a U-Net architecture is proposed in [57]. In this system, they train a U-Net based on 1D convolutional layers and GLUs.…”

Section: G Encoder-decoder Neural Networkmentioning

confidence: 99%

“…In [191], the multichannel waveforms are fed into an autoencoder. In [57], the waveforms of each channel are shifted to be temporally aligned according to the TDoA before being injected into the input layer. In the same vein, Huang et al [213], [214] proposed to time-shift the multichannel signal by calculating the time delay between the microphone position and the candidate source location, which requires to scan for all candidate locations.…”

Section: F Waveformsmentioning

confidence: 99%

See 2 more Smart Citations

A Survey of Sound Source Localization with Deep Learning Methods

Grumiaux,

Kitić,

Girin

et al. 2021

Preprint

View full text Add to dashboard Cite

This article is a survey on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. This way, an interested reader can easily comprehend the vast panorama of the deep learning-based sound source localization methods. Tables summarizing the literature survey are provided at the end of the paper for a quick search of methods with a given set of target characteristics.

show abstract

Section: Number Of Sourcesmentioning

confidence: 99%

Section: G Encoder-decoder Neural Networkmentioning

confidence: 99%

Section: F Waveformsmentioning

confidence: 99%

See 1 more Smart Citation

A Survey of Sound Source Localization with Deep Learning Methods

Grumiaux,

Kitić,

Girin

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…While these signal processing techniques can be computationally light-weight, they have a limited performance (Souden, Benesty, and Affes 2010;Kumatani et al 2012). Recent work has shown that neural networks achieve exceptional source separation in comparison Jenrungrot et al 2020) but are computationally expensive and to date, cannot run on-device on wearable computing platforms.…”

Section: Introductionmentioning

confidence: 99%

“…Time-domain approaches such as Demucs (Défossez, Synnaeve, and Adi 2020), TasNet (Luo and Mesgarani 2018) FasNet (Luo et al 2019), TAC (Luo et al 2020), Conv-TasNet (Luo and Mesgarani 2019) and its variants(Gu et al 2019b; Défossez et al 2019; Luo et al 2020; Han, Luo, and Mesgarani 2020) optimize for the learnt filters that convolve with the mixture signals to separate each sound source. While time-domain approaches allow causal construction and more effective separation, they are not designed to match directions with each separated signal from the mixture, and the computation grows exponentially with the number of sources (Jenrungrot et al 2020…”

mentioning

confidence: 99%

Hybrid Neural Networks for On-device Directional Hearing

Wang¹,

Kim²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

On-device directional hearing requires audio source separation from a given direction while achieving stringent human-imperceptible latency requirements. While neural nets can achieve significantly better performance than traditional beamformers, all existing models fall short of supporting low-latency causal inference on computationally-constrained wearables. We present Hybrid-Beam, a hybrid model that combines traditional beamformers with a custom lightweight neural net. The former reduces the computational burden of the latter and also improves its generalizability, while the latter is designed to further reduce the memory and computational overhead to enable real-time and low-latency operations. Our evaluation shows comparable performance to state-of-the-art causal inference models on synthetic data while achieving a 5x reduction of model size, 4x reduction of computation per second, 5x reduction in processing time and generalizing better to real hardware data. Further, our real-time hybrid model runs in 8 ms on mobile CPUs designed for low-power wearable devices and achieves an end-to-end latency of 17.5 ms.

show abstract

Hybrid Neural Networks for On-Device Directional Hearing

Wang

Kim

Zhang

et al. 2022

AAAI

View full text Add to dashboard Cite

On-device directional hearing requires audio source separation from a given direction while achieving stringent human-imperceptible latency requirements. While neural nets can achieve significantly better performance than traditional beamformers, all existing models fall short of supporting low-latency causal inference on computationally-constrained wearables. We present DeepBeam, a hybrid model that combines traditional beamformers with a custom lightweight neural net. The former reduces the computational burden of the latter and also improves its generalizability, while the latter is designed to further reduce the memory and computational overhead to enable real-time and low-latency operations. Our evaluation shows comparable performance to state-of-the-art causal inference models on synthetic data while achieving a 5x reduction of model size, 4x reduction of computation per second, 5x reduction in processing time and generalizing better to real hardware data. Further, our real-time hybrid model runs in 8 ms on mobile CPUs designed for low-power wearable devices and achieves an end-to-end latency of 17.5 ms.

show abstract

The Cone of Silence: Speech Separation by Localization

Cited by 5 publications

References 54 publications

A Survey of Sound Source Localization with Deep Learning Methods

A Survey of Sound Source Localization with Deep Learning Methods

Hybrid Neural Networks for On-device Directional Hearing

Hybrid Neural Networks for On-Device Directional Hearing

Contact Info

Product

Resources

About