MANNER: Multi-View Attention Network For Noise Erasure

Park, Hyun Joon; Kang, Byung Ha; Shin, Wooseok; Kim, Jin Sob; Han, Sung Won

doi:10.1109/icassp43922.2022.9747120

Cited by 24 publications

(7 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, we use the model and training setup defined in [50] and trained on the SC09 dataset 4 . For speech enhancement, we compare to MANNER [51], a recent high-performing speech enhancement model operating in the time-domain. Since we have no paired clean/noisy utterances for the SC09 dataset, we follow the technique from [52] to construct a speech enhancement dataset.…”

Section: B Baseline Systemsmentioning

confidence: 99%

See 1 more Smart Citation

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

Kamper

2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks. Code, models, samples: https://github.com/RF5/simple-asgan/.

show abstract

Section: B Baseline Systemsmentioning

confidence: 99%

“…Unseen generative task performance of ASGAN compared to task-specific systems (AutoVC[50] for voice conversion, MANNER[51] for speech enhancement).…”

mentioning

confidence: 99%

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

Kamper

2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

show abstract

“…MANNER [38] is an end-to-end multi-view attention network that currently ranks 6 th in terms of PESQ on the Voice-bank+DEMAND dataset 1 [79]. It presents a U-Net [72] -based architecture, whose blocks combine channel attention [80] with local and global attention along two signal scales similar to dual-path models [81].…”

Section: Mannermentioning

confidence: 99%

“…The generalization gap is then averaged across folds for a more accurate estimation. We use this framework to evaluate the influence of the speech, noise and room dimensions on the generalization performance of four speech enhancement systems: a FFNN-based system, Conv-TasNet [36], DCCRN [37] and MANNER [38]. Combined mismatches along multiple dimensions are also investigated.…”

Section: Introductionmentioning

confidence: 99%

Assessing the Generalization Gap of Learning-Based Speech Enhancement Systems in Noisy and Reverberant Environments

Gonzalez,

Alstrøm,

May

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

“…Cao et al proposed a generative adversarial network to model temporal and frequency correlations and achieved extremely high performance [ 22 ]. Park et al proposed a multi-view attention network to improve the accuracy of feature extraction [ 23 ].…”

Section: Introductionmentioning

confidence: 99%

Deep Learning-Based Speech Enhancement of an Extrinsic Fabry–Perot Interferometric Fiber Acoustic Sensor System

Chai

Guo

Guan

et al. 2023

Sensors

View full text Add to dashboard Cite

To achieve high-quality voice communication technology without noise interference in flammable, explosive and strong electromagnetic environments, the speech enhancement technology of a fiber-optic external Fabry–Perot interferometric (EFPI) acoustic sensor based on deep learning is studied in this paper. The combination of a complex-valued convolutional neural network and a long short-term memory (CV-CNN-LSTM) model is proposed for speech enhancement in the EFPI acoustic sensing system. Moreover, the 3 × 3 coupler algorithm is used to demodulate voice signals. Then, the short-time Fourier transform (STFT) spectrogram features of voice signals are divided into a training set and a test set. The training set is input into the established CV-CNN-LSTM model for model training, and the test set is input into the trained model for testing. The experimental findings reveal that the proposed CV-CNN-LSTM model demonstrates exceptional speech enhancement performance, boasting an average Perceptual Evaluation of Speech Quality (PESQ) score of 3.148. In comparison to the CV-CNN and CV-LSTM models, this innovative model achieves a remarkable PESQ score improvement of 9.7% and 11.4%, respectively. Furthermore, the average Short-Time Objective Intelligibility (STOI) score witnesses significant enhancements of 4.04 and 2.83 when contrasted with the CV-CNN and CV-LSTM models, respectively.

show abstract

MANNER: Multi-View Attention Network For Noise Erasure

Cited by 24 publications

References 16 publications

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

Assessing the Generalization Gap of Learning-Based Speech Enhancement Systems in Noisy and Reverberant Environments

Deep Learning-Based Speech Enhancement of an Extrinsic Fabry–Perot Interferometric Fiber Acoustic Sensor System

Contact Info

Product

Resources

About