Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention

Koizumi, Yohei; Yatabe, Kohei; Delcroix, Marc; Masuyama, Yoshiaki; Takeuchi, Daisuke

doi:10.1109/icassp40776.2020.9053214

Cited by 106 publications

(47 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several prior works for speaker extraction have studied various cues about the target speaker, such as voiceprint [11,20,21], lip movement [12,22], facial appearance [23], and spatial information [13].…”

Section: Relation To Prior Workmentioning

confidence: 99%

Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

Hao

Zhang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In the speaker extraction problem, it is found that additional information from the target speaker contributes to the tracking and extraction of the target speaker, which includes voiceprint, lip movement, facial expression, and spatial information. However, no one cares for the cue of sound onset, which has been emphasized in the auditory scene analysis and psychology. Inspired by it, we explicitly modeled the onset cue and verified the effectiveness in the speaker extraction task. We further extended to the onset/offset cues and got performance improvement. From the perspective of tasks, our onset/offset-based model completes the composite task, a complementary combination of speaker extraction and speaker-dependent voice activity detection. We also combined voiceprint with onset/offset cues. Voiceprint models voice characteristics of the target while onset/offset models the start/end information of the speech. From the perspective of auditory scene analysis, the combination of two perception cues can promote the integrity of the auditory object. The experiment results are also close to state-of-the-art performance, using nearly half of the parameters. We hope that this work will inspire communities of speech processing and psychology, and contribute to communication between them. Our code will be available in https: //github.com/aispeech-lab/wase/.

show abstract

Section: Relation To Prior Workmentioning

confidence: 99%

Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

Hao

Zhang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The related works are described as follows. The attention blocks named self-attention were applied for speech enhancement [12,17]. The self-attention focused on the response of positions in a sequence while CCBAM focuses on cross-channel and spatial information of feature maps.…”

Section: Introductionmentioning

confidence: 99%

Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

Zhao

Nguyen

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Deep complex U-Net structure and convolutional recurrent network (CRN) structure achieve state-of-the-art performance for monaural speech enhancement. Both deep complex U-Net and CRN are encoder and decoder structures with skip connections, which heavily rely on the representation power of the complex-valued convolutional layers. In this paper, we propose a complex convolutional block attention module (CCBAM) to boost the representation power of the complexvalued convolutional layers by constructing more informative features. The CCBAM is a lightweight and general module which can be easily integrated into any complex-valued convolutional layers. We integrate CCBAM with the deep complex U-Net and CRN to enhance their performance for speech enhancement. We further propose a mixed loss function to jointly optimize the complex models in both time-frequency (TF) domain and time domain. By integrating CCBAM and the mixed loss, we form a new end-to-end (E2E) complex speech enhancement framework. Ablation experiments and objective evaluations show the superior performance of the proposed approaches.

show abstract

“…However, the corresponding noise segment in the same environment need to be prepared to create the noise embedding through an embedding subnetwork and the process of speech enhancement during inference is not convenient. Some research related to speaker-aware [13,14] or signal-to-noise-ratio (SNR) aware [15] algorithms have also been proposed to improve the speech enhancement model's denoising performance.…”

Section: Introductionmentioning

confidence: 99%

Neural Noise Embedding for End-To-End Speech Enhancement with Conditional Layer Normalization

Zhang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Most of the deep learning based speech enhancement methods focus on the modeling of complicated relationship between the noisy speech and the clean speech without the consideration of noise information. In order to cope with various complex noise scenes, we introduce a novel enhancement architecture that integrates a deep autoencoder with neural noise embedding. In this study, a new normalization method, termed conditional layer normalization (CLN), is introduced to improve the generalization of deep learning based speech enhancement approaches for unseen environments. The noise embedding is passed through the CLN layers to regularize the network for speech enhancement task. The proposed network can be adaptively adjusted according to different noise information extracted from the noisy speech input. The network in overall is trained in an end-to-end manner and the experimental results show that the proposed scheme produces satisfactory enhancement performance comparing the other methods. The visualization shows that our proposed network captures noise information, which is helpful to improve robustness to unseen environments for speech enhancement.

show abstract

Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention

Cited by 106 publications

References 30 publications

Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

Neural Noise Embedding for End-To-End Speech Enhancement with Conditional Layer Normalization

Contact Info

Product

Resources

About