Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1354
|View full text |Cite
|
Sign up to set email alerts
|

Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection

Abstract: Speech Activity Detection (SAD) plays an important role in mobile communications and automatic speech recognition (ASR). Developing efficient SAD systems for real-world applications is a challenging task due to the presence of noise. We propose a new approach to SAD where we treat it as a twodimensional multilabel image classification problem. To classify the audio segments, we compute their Short-time Fourier Transform spectrograms and classify them with a Convolutional Recurrent Neural Network (CRNN), tradit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
15
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 22 publications
(15 citation statements)
references
References 29 publications
0
15
0
Order By: Relevance
“…The buffer window lengths for speech and non-speech segments (Lsp and Lnsp in Eqs. [3][4][5] are set to 30 and 60 segments corresponding to 3 and 6 sec, respectively. The adaptation weights α and β (Eqs.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…The buffer window lengths for speech and non-speech segments (Lsp and Lnsp in Eqs. [3][4][5] are set to 30 and 60 segments corresponding to 3 and 6 sec, respectively. The adaptation weights α and β (Eqs.…”
Section: Resultsmentioning
confidence: 99%
“…The adaptation weights α and β (Eqs. [3][4][5] are set to 0.4 and 0.1, respectively. The decision threshold for the non-adaptive system (θ M ) is set to 0.25.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…A number of examples of the use of CRNN models in audio processing can be found in the literature [15,16]. CRNN models have been also applied to the SAD task with relevant results [17].…”
Section: Introductionmentioning
confidence: 99%
“…However, direct use of these DL systems in speech/non-speech classification often cannot fully mine speech information under noisy conditions, which reduces VAD robustness. The most important reason for the above problem is that, the encoders of the network are not robust, that is, they are designed to use energy levels, zerocrossings, log-spectrograms, Mel-frequency cepstral coefficients (MFCCs), and raw waveforms [6,10,13].…”
Section: Introductionmentioning
confidence: 99%