Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence 2020
DOI: 10.24963/ijcai.2020/528
|View full text |Cite
|
Sign up to set email alerts
|

Joint Time-Frequency and Time Domain Learning for Speech Enhancement

Abstract: For single-channel speech enhancement, both time-domain and time-frequency-domain methods have their respective pros and cons. In this paper, we present a cross-domain framework named TFT-Net, which takes time-frequency spectrogram as input and produces time-domain waveform as output. Such a framework takes advantage of the knowledge we have about spectrogram and avoids some of the drawbacks that T-F-domain methods have been suffering from. In TFT-Net, we design an innovative dual-path attention block … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
21
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 47 publications
(22 citation statements)
references
References 1 publication
1
21
0
Order By: Relevance
“…In recent years, deep neural networks have shown great potential for single-channel speech enhancement (SE) (or noisesuppression) [2,3,4,5,6,7,8]. Although these models substantially remove background noise, most of them degrade the performance of downstream tasks such as automatic speech recognition (ASR) performance significantly, as modern commercial multi-condition trained ASR systems can usually recognize original noisy speech [5] well and the SE models introduces unseen distortions that are particularly harmful to ASR.…”
Section: Introductionmentioning
confidence: 99%
“…In recent years, deep neural networks have shown great potential for single-channel speech enhancement (SE) (or noisesuppression) [2,3,4,5,6,7,8]. Although these models substantially remove background noise, most of them degrade the performance of downstream tasks such as automatic speech recognition (ASR) performance significantly, as modern commercial multi-condition trained ASR systems can usually recognize original noisy speech [5] well and the SE models introduces unseen distortions that are particularly harmful to ASR.…”
Section: Introductionmentioning
confidence: 99%
“…1) Adaptive time-frequency attention Transformer : To alleviate the heavy computational complexity of conventional self-attention, we introduce an adaptive time-frequency attention (ATFA) mechanism as a lightweight solution to capture long-range correlations exhibited in the temporal and spectral axes, as described in [22], [25]. As illustrated in Fig.…”
Section: A Densely Convolutional Encodermentioning
confidence: 99%
“…In each branch, different from the Vanilla transformer, a GRU-based improved transformer [7] is employed, which is comprised of multi-head self-attention (MHSA) and GRUbased position-wise network, followed by residual connections and LN. Multi-head self-attention has been widely used in the natural language processing and speech area, as it can leverage the contextual information in the feature maps [21], [22], [25], [26]. In MHSA modules, the input features are first linearly mapped with different linear projections h times to get queries (Q), keys (K) and values (V ) representation.…”
Section: A Densely Convolutional Encodermentioning
confidence: 99%
“…Recently, the progress of deep learning algorithms has brought substantial improvements also in the SE field [ 14 , 15 , 16 , 17 , 18 , 19 ]. Deep learning techniques are data-driven approaches that frames the SE task as a supervised learning problem aiming at reconstructing the target speech signals from the noisy mixture.…”
Section: Introductionmentioning
confidence: 99%