Sudo RM -RF: Efficient Networks for Universal Audio Source Separation

Tzinis, Efthymios; Wang, Zhepei; Smaragdis, Paris

doi:10.1109/mlsp49062.2020.9231900

Cited by 77 publications

(61 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The proposed model is based on the learned-domain masking approach [14,15,[17][18][19][20][21][22] and employs an encoder, a decoder, and a masking network, as shown in Figure 1. The encoder is fully convolutional, while the masking network employs two Transformers embedded inside the dual-path processing block proposed in [17].…”

Section: The Modelmentioning

confidence: 99%

Attention Is All You Need In Speech Separation

Subakan

Ravanelli

Cornell

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

266

169

View full text Add to dashboard Cite

Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism.In this paper, we propose the SepFormer, a novel RNN-free Transformer-based neural network for speech separation. The Sep-Former learns short and long-term dependencies with a multi-scale approach that employs transformers. The proposed model achieves state-of-the-art (SOTA) performance on the standard WSJ0-2/3mix datasets. It reaches an SI-SNRi of 22.3 dB on WSJ0-2mix and an SI-SNRi of 19.5 dB on WSJ0-3mix. The SepFormer inherits the parallelization advantages of Transformers and achieves a competitive performance even when downsampling the encoded representation by a factor of 8. It is thus significantly faster and it is less memory-demanding than the latest speech separation systems with comparable performance.

show abstract

Section: The Modelmentioning

confidence: 99%

Attention Is All You Need In Speech Separation

Subakan

Ravanelli

Cornell

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

266

169

View full text Add to dashboard Cite

show abstract

“…Although the selected Sudo rm -rf configuration has a large enough receptive field to cover the entire sequential feature, it obtains an even worse separation performance with on-par model size and complexity as the TCN architecture. Although [15] reported that the Sudo rm -rf architecture achieved constantly better performance than DPRNN and TCN architectures, the results here indicates that its performance on the more challenging noisy reverberant environments needs to be revised. Moreover, although all four architectures achieve significant SI-SDR improvement with respect to the unprocessed mixture, the improvement on wideband PESQ and STOI scores are moderate.…”

Section: Effect Of Gc3 In Different Separation Modulesmentioning

confidence: 64%

“…Since the context codec squeezes the long sequence by a factor of C/2 (16 for C = 32), the effective temporal receptive field of the TCN separator is significantly larger (0.253 × 16 = 4.05s) and thus can better capture the temporal dependencies. Since it has also been reported in [15] that a deeper Sudo rm -rf architecture can lead to better overall separation performance, introducing GC3 to Sudo rm -rf might also be equivalent to increasing the model depth and improves the performance. More in-depth analysis on the reason behind the performance improvements in different architectures is left for future work.…”

Section: Effect Of Gc3 In Different Separation Modulesmentioning

confidence: 99%

“…Squeezing the input contexts into higher-level representations corresponds to a nonlinear downsampling step that generates context-level embeddings and significantly decreases the length of a feature sequence. Note that compared with other architectures that perform iterative downsampling and upsampling steps [9], [15], the context codec is only applied once and all remaining modeling steps are applied on the downsampled features, which enables a smaller memory footprint and faster training speed. We call the combination of GroupComm and Context Codec the GC3 design.…”

Section: Introductionmentioning

confidence: 99%

“…In this paper, we also explore different architectures for the GroupComm module and investigate the effect of different hyperparameters in the system configuration. Moreover, to validate the effect of GC3 on different network architectures, we select three other separation modules beyond the original dual-path RNN (DPRNN) baseline [22] applied in our previous work on GroupComm: two other CNN-based architectures, namely the temporal convolutional network (TCN) [11] and the sudo rm -rf network [15], and one transformer-based architecture, namely the dual-path transformer network (DPTNet) [24]. Experimental results show that the GC3-equipped DPRNN can achieve on-par performance with the baseline DPRNN with 4.7% model size and 17.6% MAC operations, the GC3equipped CNN-based models can significantly improve the overall performance with as few as 2.5% model size and 33.7% MAC operations, and the GC3-equipped transformer-based model can maintain on-par performance with 4.6% model size and 17.7% MAC operations.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Group Communication With Context Codec for Lightweight Source Separation

Luo

Han

Mesgarani

2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Despite the recent progress on neural network architectures for speech separation, the balance between the model size, model complexity and model performance is still an important and challenging problem for the deployment of such models to low-resource platforms. In this paper, we propose two simple modules, group communication and context codec, that can be easily applied to a wide range of architectures to jointly decrease the model size and complexity without sacrificing the performance. A group communication module splits a highdimensional feature into groups of low-dimensional features and captures the inter-group dependency. A separation module with a significantly smaller model size can then be shared by all the groups. A context codec module, containing a context encoder and a context decoder, is designed as a learnable downsampling and upsampling module to decrease the length of a sequential feature processed by the separation module. The combination of the group communication and the context codec modules is referred to as the GC3 design. Experimental results show that applying GC3 on multiple network architectures for speech separation can achieve on-par or better performance with as small as 2.5% model size and 17.6% model complexity, respectively.

show abstract

Cascade Multiscale Swin-Conv Network for Fast MRI Reconstruction

Xie

Xiong

et al. 2022

Pattern Recognition and Computer Vision

View full text Add to dashboard Cite

Sudo RM -RF: Efficient Networks for Universal Audio Source Separation

Cited by 77 publications

References 16 publications

Attention Is All You Need In Speech Separation

Attention Is All You Need In Speech Separation

Group Communication With Context Codec for Lightweight Source Separation

Cascade Multiscale Swin-Conv Network for Fast MRI Reconstruction

Contact Info

Product

Resources

About