2020
DOI: 10.48550/arxiv.2010.06007
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Cone of Silence: Speech Separation by Localization

Teerapat Jenrungrot,
Vivek Jayaram,
Steve Seitz
et al.

Abstract: Given a multi-microphone recording of an unknown number of speakers talking concurrently, we simultaneously localize the sources and separate the individual speakers. At the core of our method is a deep network, in the waveform domain, which isolates sources within an angular region θ ± w/2, given an angle of interest θ and angular window size w. By exponentially decreasing w, we can perform a binary search to localize and separate all sources in logarithmic time. Our algorithm allows for an arbitrary number o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(12 citation statements)
references
References 54 publications
(72 reference statements)
0
12
0
Order By: Relevance
“…[53], [54], [55]. Speaker localization, diarization and (speech) source separation are intrinsically connected problems, as the information retrieved from solving each one of them can be useful for addressing the others [23], [56], [57]. An investigation of those connections is out of the scope of the present survey.…”
Section: Number Of Sourcesmentioning
confidence: 99%
See 2 more Smart Citations
“…[53], [54], [55]. Speaker localization, diarization and (speech) source separation are intrinsically connected problems, as the information retrieved from solving each one of them can be useful for addressing the others [23], [56], [57]. An investigation of those connections is out of the scope of the present survey.…”
Section: Number Of Sourcesmentioning
confidence: 99%
“…This spectral mask is finally applied for source separation. Another joint localization and separation system based on a U-Net architecture is proposed in [57]. In this system, they train a U-Net based on 1D convolutional layers and GLUs.…”
Section: G Encoder-decoder Neural Networkmentioning
confidence: 99%
See 1 more Smart Citation
“…While these signal processing techniques can be computationally light-weight, they have a limited performance (Souden, Benesty, and Affes 2010;Kumatani et al 2012). Recent work has shown that neural networks achieve exceptional source separation in comparison Jenrungrot et al 2020) but are computationally expensive and to date, cannot run on-device on wearable computing platforms.…”
Section: Introductionmentioning
confidence: 99%
“…Time-domain approaches such as Demucs (Défossez, Synnaeve, and Adi 2020), TasNet (Luo and Mesgarani 2018) FasNet (Luo et al 2019), TAC (Luo et al 2020), Conv-TasNet (Luo and Mesgarani 2019) and its variants(Gu et al 2019b; Défossez et al 2019; Luo et al 2020; Han, Luo, and Mesgarani 2020) optimize for the learnt filters that convolve with the mixture signals to separate each sound source. While time-domain approaches allow causal construction and more effective separation, they are not designed to match directions with each separated signal from the mixture, and the computation grows exponentially with the number of sources (Jenrungrot et al 2020…”
mentioning
confidence: 99%