A Time-Frequency Attention Module for Neural Speech Enhancement

Zhang, Qiquan; Qian, Xinyuan; Ni, Zhaoheng; Nicolson, Aaron; Ambikairajah, Eliathamby; Li, Haizhou

doi:10.1109/taslp.2022.3225649

Cited by 23 publications

(9 citation statements)

References 73 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Through this selective information aggregation mechanism, the speech enhancement network can better preserve the desired speech characteristics and remove uncorrelated noise information more effectively. Currently, there are three popular ways to compute the attention vector in speech enhancement deep networks: channel attention [25], spatial attention [25], and time-frequency (T-F) attention [26]. By using different perspectives to discriminate the importance of different contextual spectral information, each way has its unique advantage in boosting network performance.…”

Section: Triple-attention-based Tcnn (Ta-tcnn)mentioning

confidence: 99%

“…As mentioned earlier, in this paper, we also introduce the T-F attention presented in [26] and exploit it to characterize a salient energy distribution of speech in the time and frequency dimensions. As shown in the right part of Figure 4, the T-F attention block includes two parallel attention paths: time-dimension attention and frequency-dimension attention.…”

Section: Time-frequency (T-f) Attentionmentioning

confidence: 99%

See 1 more Smart Citation

Stacked Multiscale Densely Connected Temporal Convolutional Attention Network for Multi-Objective Speech Enhancement in an Airborne Environment

Huang,

2024

Aerospace

View full text Add to dashboard Cite

Airborne speech enhancement is always a major challenge for the security of airborne systems. Recently, multi-objective learning technology has become one of the mainstream methods of monaural speech enhancement. In this paper, we propose a novel multi-objective method for airborne speech enhancement, called the stacked multiscale densely connected temporal convolutional attention network (SMDTANet). More specifically, the core of SMDTANet includes three parts, namely a stacked multiscale feature extractor, a triple-attention-based temporal convolutional neural network (TA-TCNN), and a densely connected prediction module. The stacked multiscale feature extractor is leveraged to capture comprehensive feature information from noisy log-power spectra (LPS) inputs. Then, the TA-TCNN adopts a combination of these multiscale features and noisy amplitude modulation spectrogram (AMS) features as inputs to improve its powerful temporal modeling capability. In TA-TCNN, we integrate the advantages of channel attention, spatial attention, and T-F attention to design a novel triple-attention module, which can guide the network to suppress irrelevant information and emphasize informative features of different views. The densely connected prediction module is used to reliably control the flow of the information to provide an accurate estimation of clean LPS and the ideal ratio mask (IRM). Moreover, a new joint-weighted (JW) loss function is constructed to further improve the performance without adding to the model complexity. Extensive experiments on real-world airborne conditions show that our SMDTANet can obtain an on-par or better performance compared to other reference methods in terms of all the objective metrics of speech quality and intelligibility.

show abstract

Section: Triple-attention-based Tcnn (Ta-tcnn)mentioning

confidence: 99%

Section: Time-frequency (T-f) Attentionmentioning

confidence: 99%

Stacked Multiscale Densely Connected Temporal Convolutional Attention Network for Multi-Objective Speech Enhancement in an Airborne Environment

Huang,

2024

Aerospace

View full text Add to dashboard Cite

show abstract

“…Where QW Q h , QK K h , QV V h are the learnable parameters. The attention mechanism has gained recent attention and has been the subject of studies [46], [47]. These studies have shown that attention mechanisms can effectively model the distributions of speech signals across frequency and time dimensions.…”

Section: A Multi-head Self Attention Transformer With Time-frequency ...mentioning

confidence: 99%

Multi-Attention Bottleneck for Gated Convolutional Encoder-Decoder-Based Speech Enhancement

Saleem,

Gunawan,

Shafi

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Convolutional encoder-decoder (CED) has emerged as a powerful architecture, particularly in speech enhancement (SE), which aims to improve the intelligibility and quality and intelligibility of noise-contaminated speech. This architecture leverages the strength of the convolutional neural networks (CNNs) in capturing high-level features. Usually, the CED architectures use the gated recurrent unit (GRU) or long-short-term memory (LSTM) as a bottleneck to capture temporal dependencies, enabling a SE model to effectively learn the dynamics and long-term temporal dependencies in the speech signal. However, Transformers neural networks with self-attention effectively capture long-term temporal dependencies. This study proposes a multi-attention bottleneck (MAB) comprised of a self-attention Transformer powered by a time-frequency attention (TFA) module followed by a channel attention module (CAM) to focus on the important features. The proposed bottleneck (MAB) is integrated into a CED architecture and named MAB-CED. The MAB-CED uses an encoder-decoder structure including a shared encoder and two decoders, where one decoder is dedicated to spectral masking and the other is used for spectral mapping. Convolutional Gated Linear Units (ConvGLU) and Deconvolutional Gated Linear Units (DeconvGLU) are used to construct the encoder-decoder framework. The outputs of two decoders are coupled by applying coherent averaging to synthesize the enhanced speech signal. The proposed speech enhancement is examined using two databases, VoiceBank+DEMAND and LibriSpeech. The results show that the proposed speech enhancement outperforms the benchmarks in terms of intelligibility and quality at various input SNRs. This indicates the performance of the proposed MAB-CED at improving the average PESQ by 0.55 (22.85%) with VoiceBank+DEMAND and by 0.58 (23.79%) with LibriSpeech. The average STOI is improved by 9.63% (VoiceBank+DEMAND) and 9.78% ( LibriSpeech) over the noisy mixtures.

show abstract

“…Inspired by speech enhancement techniques (Zhang et al 2020(Zhang et al , 2022 that restore clear speech from noisy recordings, we aim to mitigate the adverse effects of lip occlusion on audio-visual speech recognition by restoring occluded lips. Considering the partially occluded lip shown in Fig.…”

Section: Introductionmentioning

confidence: 99%

Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

Wang,

Pan,

Zhang

et al. 2024

AAAI

View full text Add to dashboard Cite

Prior studies on audio-visual speech recognition typically assume the visibility of speaking lips, ignoring the fact that visual occlusion occurs in real-world videos, thus adversely affecting recognition performance. To address this issue, we propose a framework that restores occluded lips in a video by utilizing both the video itself and the corresponding noisy audio. Specifically, the framework aims to achieve these three tasks: detecting occluded frames, masking occluded areas, and reconstruction of masked regions. We tackle the first two issues by utilizing the Class Activation Map (CAM) obtained from occluded frame detection to facilitate the masking of occluded areas. Additionally, we introduce a novel synthesis-matching strategy for the reconstruction to ensure the compatibility of audio features with different levels of occlusion. Our framework is evaluated in terms of Word Error Rate (WER) on the original videos, the videos corrupted by concealed lips, and the videos restored using the framework with several existing state-of-the-art audio-visual speech recognition methods. Experimental results substantiate that our framework significantly mitigates performance degradation resulting from lip occlusion. Under -5dB noise conditions, AV-Hubert's WER increases from 10.62% to 13.87% due to lip occlusion, but rebounds to 11.87% in conjunction with the proposed framework. Furthermore, the framework also demonstrates its capacity to produce natural synthesized images in qualitative assessments.

show abstract

A Time-Frequency Attention Module for Neural Speech Enhancement

Cited by 23 publications

References 73 publications

Stacked Multiscale Densely Connected Temporal Convolutional Attention Network for Multi-Objective Speech Enhancement in an Airborne Environment

Stacked Multiscale Densely Connected Temporal Convolutional Attention Network for Multi-Objective Speech Enhancement in an Airborne Environment

Multi-Attention Bottleneck for Gated Convolutional Encoder-Decoder-Based Speech Enhancement

Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

Contact Info

Product

Resources

About