ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053951
|View full text |Cite
|
Sign up to set email alerts
|

Full-Reference Speech Quality Estimation with Attentional Siamese Neural Networks

Abstract: In this paper, we present a full-reference speech quality prediction model with a deep learning approach. The model determines a feature representation of the reference and the degraded signal through a siamese recurrent convolutional network that shares the weights for both signals as input. The resulting features are then used to align the signals with an attention mechanism and are finally combined to estimate the overall speech quality. The proposed network architecture represents a simple solution for the… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
2

Relationship

2
5

Authors

Journals

citations
Cited by 12 publications
(7 citation statements)
references
References 14 publications
(14 reference statements)
0
7
0
Order By: Relevance
“…In a next step, the pretraining database could be improved further with an even wider variety of different speakers and conditions that are more similar to TTS distortions. Also, the full-reference speech quality model presented in [19], which automatically aligns the reference to the degraded signal, could be used to estimate the similarity between original and synthesized/voice-conversed speakers.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…In a next step, the pretraining database could be improved further with an even wider variety of different speakers and conditions that are more similar to TTS distortions. Also, the full-reference speech quality model presented in [19], which automatically aligns the reference to the degraded signal, could be used to estimate the similarity between original and synthesized/voice-conversed speakers.…”
Section: Discussionmentioning
confidence: 99%
“…We use the pretraining database from [19] to first build a speech quality prediction network that is trained on speech communication network degradation. We then use the speech quality prediction domain knowledge of the neural network to im-1 github.com/gabrielmittag/NISQA prove the reliability of synthesized speech naturalness prediction through transfer learning.…”
Section: Speech Quality Pretrainingmentioning
confidence: 99%
See 1 more Smart Citation
“…CNN-LSTM is a neural network structure that combines CNN and LSTM and it has been recently used for speech quality assessment [2,7,25]. In this structure, CNNs extract deep features of speech and the CNN feature vectors are then used as input for LSTM network that models time dependencies, which means that CNN-LSTM has the advantages of both CNN and LSTM.…”
Section: Cnn-lstmmentioning
confidence: 99%
“…The speech quality prediction model used in this paper is a narrowband (up to 4 kHz) version of the CNN-LSTM neural network presented in [11,25] 3 . Instead of using mel-spectrograms with a maximum frequency of 16 kHz and 48 mel bands, the mel spectrogram inputs to the proposed PSTN model have a maximum frequency of 4 kHz and 32 mel bands.…”
Section: Model Descriptionmentioning
confidence: 99%