Convolutional Neural Networks to Enhance Coded Speech

Zhao, Ziyue; Liu, Huijun; Fingscheidt, Tim

doi:10.1109/taslp.2018.2887337

Cited by 63 publications

(43 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• The paper shows that a mask based post-filter in the spectral domain performs better than cepstral-domain post-filter (Cepstrum-CNN) as proposed in [12,13].…”

Section: Key Contribution Of This Papermentioning

confidence: 95%

Enhancement of Coded Speech Using a Mask-Based Post-Filter

Korse

Gupta

Fuchs

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The quality of speech codecs deteriorates at low bitrates due to high quantization noise. A post-filter is generally employed to enhance the quality of the coded speech. In this paper, a data-driven postfilter relying on masking in the time-frequency domain is proposed. A fully connected neural network (FCNN), a convolutional encoderdecoder (CED) network and a long short-term memory (LSTM) network are implemeted to estimate a real-valued mask per timefrequency bin. The proposed models were tested on the five lowest operating modes (6.65 kbps-15.85 kbps) of the Adaptive Multi-Rate Wideband codec (AMR-WB). Both objective and subjective evaluations confirm the enhancement of the coded speech and also show the superiority of the mask-based neural network system over a conventional heuristic post-filter used in the standard like ITU-T G.718.

show abstract

“…• The paper shows that a mask based post-filter in the spectral domain performs better than cepstral-domain post-filter (Cepstrum-CNN) as proposed in [12,13].…”

Section: Key Contribution Of This Papermentioning

confidence: 95%

Enhancement of Coded Speech Using a Mask-Based Post-Filter

Korse

Gupta

Fuchs

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…With the advent of deep learning, an increasing number of studies using deep neural networks (DNNs) for speech enhancement have shown that these models are able to significantly outperform classical and other machine learning-based methods in terms of speech quality and intelligibility [12][13][14][15][16][17][18][19][20][21]. This is especially true for non-stationary noise conditions, where deep learning-based methods have the advantage of making no assumptions on the stationarity of noise or the underlying distributions of speech and noise.…”

Section: Introductionmentioning

confidence: 99%

“…Park et al [27] demonstrate the effectivity of different variations of CEDs and Takahashi et al [20] introduce densely connected convolutional layers and multi-band processing into the architecture. A CED network has also been used by Zhao et al to enhance encoded and subsequently decoded speech in a postprocessing step, showing remarkable generalization capabilities even to unseen codecs [18].…”

Section: Introductionmentioning

confidence: 99%

“…In combination with deep learning models, the multi-stage paradigm has been applied to music source separation using feedforward DNNs for the separation task as well as the subsequent task of enhancing the separated signals [37]. A further possibility is proposed in [38], where denoising and dereverberation are addressed in subsequent stages using separately trained feedforward DNNs and joint fine-tuning of the two-stage model is carried out in a second step.…”

Section: Introductionmentioning

confidence: 99%

“…We propose to address this problem by first performing noise suppression and subsequently restoring natural sounding speech. Different to [37] and [38], we rely on specifically chosen DNN topologies with beneficial properties for each of the two tasks. An LSTM-based model with its ability to use longterm temporal context to distinguish between noise and speech is used for noise suppression.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration

Strake

Defraene²,

Fluyt³

et al. 2020

EURASIP J. Adv. Signal Process.

Self Cite

View full text Add to dashboard Cite

Single-channel speech enhancement in highly non-stationary noise conditions is a very challenging task, especially when interfering speech is included in the noise. Deep learning-based approaches have notably improved the performance of speech enhancement algorithms under such conditions, but still introduce speech distortions if strong noise suppression shall be achieved. We propose to address this problem by using a two-stage approach, first performing noise suppression and subsequently restoring natural sounding speech, using specifically chosen neural network topologies and loss functions for each task. A mask-based long short-term memory (LSTM) network is employed for noise suppression and speech restoration is performed via spectral mapping with a convolutional encoder-decoder network (CED). The proposed method improves speech quality (PESQ) over state-of-the-art single-stage methods by about 0.1 points for unseen highly non-stationary noise types including interfering speech. Furthermore, it is able to increase intelligibility in low-SNR conditions and consistently outperforms all reference methods.

show abstract

Speech Enhancement Based on Deep AutoEncoder for Remote Arabic Speech Recognition

Dendani

Bahi

Sari

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Remote applications that deal with speech need the speech signal to be compressed. First, speech coding transforms the continuous waveform into a numerical form. Then, the digitized signal is compressed with or without loss of information. This transformation affects the original waveform and degrades performances for further recognition of the speech signal. Meanwhile, the transmission is another source of speech degradation. To restore the original "clean" speech, speech enhancement (SE) is widely used, and deep learning algorithms are state-of-the-art, nowadays. In this paper, the target application is a remote Arabic speech recognition system, and the aim of using SE is to improve the accuracy of the speech recognizer. For that purpose, a Deep Auto Encoder (DAE) is used. The effect of the DAE-based SE is studied through different configurations, and the performances are evaluated through accuracy. The results showed an improvement of about 3.17 between the accuracy prior to the SE and that computed with the enhanced speech.

show abstract

Convolutional Neural Networks to Enhance Coded Speech

Cited by 63 publications

References 45 publications

Enhancement of Coded Speech Using a Mask-Based Post-Filter

Enhancement of Coded Speech Using a Mask-Based Post-Filter

Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration

Speech Enhancement Based on Deep AutoEncoder for Remote Arabic Speech Recognition

Contact Info

Product

Resources

About