ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053723
|View full text |Cite
|
Sign up to set email alerts
|

Invertible DNN-Based Nonlinear Time-Frequency Transform for Speech Enhancement

Abstract: We propose an end-to-end speech enhancement method with trainable time-frequency (T-F) transform based on invertible deep neural network (DNN). The resent development of speech enhancement is brought by using DNN. The ordinary DNN-based speech enhancement employs T-F transform, typically the short-time Fourier transform (STFT), and estimates a T-F mask using DNN. On the other hand, some methods have considered end-to-end networks which directly estimate the enhanced signals without T-F transform. While end-to-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 9 publications
(5 citation statements)
references
References 30 publications
0
5
0
Order By: Relevance
“…For image processing, an invertible DNN named i-Revnet has been developed [29] and has recently been used as a trainable time-frequency transform of a DNN-based speech enhancement system [30]. i-Revnet alternately performs the squeezing operation and the application of nonlinear functions to only half of each output of the squeezing operation.…”
Section: A Conventional Ds Layersmentioning
confidence: 99%
“…For image processing, an invertible DNN named i-Revnet has been developed [29] and has recently been used as a trainable time-frequency transform of a DNN-based speech enhancement system [30]. i-Revnet alternately performs the squeezing operation and the application of nonlinear functions to only half of each output of the squeezing operation.…”
Section: A Conventional Ds Layersmentioning
confidence: 99%
“…An important requirement in DNN-based speech enhancement and separation is generalization that means working for any speaker. To achieve this, in speech enhancement, several studies train a global M using many speech samples spoken by many speakers [3][4][5][6][7][8][9][10][11][12][13][14]. Unfortunately, in speech separation, generalization cannot be achieved solely using a large scale training dataset because there is no way of knowing which signal in the speech-mixture is the target.…”
Section: Auxiliary Speaker-aware Feature For Speech Separationmentioning
confidence: 99%
“…Generalization is an important requirement in DNN-based speech enhancement to enable enhancing unknown speakers' speech. To achieve this, several previous studies train a speaker independent DNN using many speech samples spoken by many speakers [3][4][5][6][7][8][9][10][11][12][13][14]. Meanwhile, in other speech applications, model specialization to the target speaker has succeeded [15,16].…”
Section: Introductionmentioning
confidence: 99%
“…This setup is a standard architecture in DNNbased speech enhancement [6]. The input of the DNN was logamplitude spectrogram of the observed signal x whose size was F × K. The kernel size, stride, and padding of both 2-D CNNs were (5,15), (1,1) and (2,7), respectively. The number of output channels of the first and second 2-D CNNs were 30 and 60, respectively.…”
Section: Experimental Setupsmentioning
confidence: 99%
“…Over the last decade, the use of deep neural network (DNN) for speech enhancement has substantially advanced the state-ofthe-art performance [3][4][5][6][7][8][9][10][11][12][13][14][15]. The popular strategy is to estimate a time-frequency (T-F) mask by a DNN and apply it in the short-time Fourier transform (STFT)-domain [3], where the enhanced signal is obtained by the inverse STFT.…”
Section: Introductionmentioning
confidence: 99%