Attention Is All You Need In Speech Separation

Subakan, Cem; Ravanelli, Mirco; Cornell, Samuele; Bronzi, Mirko; Zhong, Jianyuan

doi:10.1109/icassp39728.2021.9413901

Cited by 309 publications

(216 citation statements)

References 26 publications

Supporting

Mentioning

169

Contrasting

Order By: Relevance

“…From the results of experiments, we conclude that DF-Conformer is an effective model for SE. Future works include jointtraining of SE and ASR using an all Conformer model, and comparison with the dual-path methods [16][17][18] on the SE task.…”

Section: Discussionmentioning

confidence: 99%

“…One possible approach is to use the dual-path approach [16][17][18], which is equivalent to using sparse and block diagonal attention matrices corresponding to the inter-and intra-transformers, respectively. Alternatively, we use FAVOR+ attention introduced in Performer [26] which has linear computational complexity : O(N ).…”

Section: Model Structure and Computational Challengesmentioning

confidence: 99%

“…Conv-TasNet [12] is a powerful model for SE that uses a combination of trainable analysis/synthesis filterbanks [13] and a mask prediction network using stacked 1-D dilated depthwise convolution (1D-DDC) layers. Since the denoising performance and computational efficiency are mainly affected by the mask prediction network, one of the main research topics in SE is improving the mask prediction architecture [14][15][16][17][18][19][20]. For example, the improved time-dilated convolution network (TDCN++) [14,15] extended Conv-TasNet to improve SE performance.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement

Karita¹,

Wisdom²,

Erdoğan³

et al. 2021

Preprint

View full text Add to dashboard Cite

Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an analysis/synthesis filterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the denoising performance and computational efficiency are mainly affected by the structure of the mask prediction network. In this study, we aim to improve the sequential modeling ability of Conv-TasNet architectures by integrating Conformer layers into a new mask prediction network. To make the model computationally feasible, we extend the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers. We trained the model on 3,396 hours of noisy speech data, and show that (i) the use of linear complexity attention avoids high computational complexity, and (ii) our model achieves higher scale-invariant signal-to-noise ratio than the improved time-dilated convolution network (TDCN++), an extended version of Conv-TasNet.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Model Structure and Computational Challengesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement

Karita¹,

Wisdom²,

Erdoğan³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Recent advances in deep learning have enabled the development of neural network architectures capable of separating individual sound sources from mixtures of sounds with high fidelity. Discriminative separation models with supervised training have obtained stateof-the-art performance on multiple tasks such as music separation [1], speech separation [2,3] and speech enhancement [4,5]. However, gathering clean source waveforms to perform supervised training under various domains can be cumbersome or even impossible.…”

Section: Introductionmentioning

confidence: 99%

Separate But Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data

Tzinis

Casebeer

Wang

et al. 2021

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

View full text Add to dashboard Cite

We propose FEDENHANCE, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients. We simulate a realworld scenario where each client only has access to a few noisy recordings from a limited and disjoint number of speakers (hence non-IID). Each client trains their model in isolation using mixture invariant training while periodically providing updates to a central server. Our experiments show that our approach achieves competitive enhancement performance compared to IID training on a single device and that we can further facilitate the convergence speed and the overall performance using transfer learning on the server-side. Moreover, we show that we can effectively combine updates from clients trained locally with supervised and unsupervised losses. We also release a new dataset LibriFSD50K and its creation recipe in order to facilitate FL research for source separation problems.

show abstract

“…The main focus of Tas-Net is the separator that estimates the masks. A lot of work has since been done to improve the separator, such as fullyconvolutional TasNet (Conv-TasNet) [12], dual-path recurrent neural network (DPRNN) [19], gated DualPathRNN [14], dualpath transformer network (DPT-Net) [15], and SepFormer [20]. Among them, the dual-path method is the mainstream, which processes the waveform from the two dimensions of the local path and the global path.…”

Section: Introductionmentioning

confidence: 99%

Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation

Wang¹,

Peng

Lee

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Speech separation has been extensively studied to deal with the cocktail party problem in recent years. All related approaches can be divided into two categories: time-frequency domain methods and time domain methods. In addition, some methods try to generate speaker vectors to support source separation. In this study, we propose a new model called dualpath filter network (DPFN). Our model focuses on the postprocessing of speech separation to improve speech separation performance. DPFN is composed of two parts: the speaker module and the separation module. First, the speaker module infers the identities of the speakers. Then, the separation module uses the speakers' information to extract the voices of individual speakers from the mixture. DPFN constructed based on DPRNN-TasNet is not only superior to DPRNN-TasNet, but also avoids the problem of permutation-invariant training (PIT).

show abstract

Attention Is All You Need In Speech Separation

Cited by 309 publications

References 26 publications

DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement

DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement

Separate But Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data

Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation

Contact Info

Product

Resources

About