Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-268
|View full text |Cite
|
Sign up to set email alerts
|

Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection

Abstract: Voice Activity Detection (VAD) is an important preprocessing step in any state-of-the-art speech recognition system. Choosing the right set of features and model architecture can be challenging and is an active area of research. In this paper we propose a novel approach to VAD to tackle both feature and model selection jointly. The proposed method is based on a CLDNN (Convolutional, Long Short-Term Memory, Deep Neural Networks) architecture fed directly with the raw waveform. We show that using the raw wavefor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
52
0
1

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 81 publications
(54 citation statements)
references
References 13 publications
1
52
0
1
Order By: Relevance
“…The architecture of the ASR system is same as described in Section 2.3, except that the speaker roles are removed from the transcript. Based on our past experience with conventional SD system, we built a strong baseline system consisting of the following five stages: (a) Speech detection and segmentation: This stage consists of an LSTM-based speech detector whose threshold is kept low to minimize deletion of speech segments [29]. (b) Speaker embedding: The speaker embeddings are computed using a sliding window of 1sec with a stride of 100ms.…”
Section: Baselinementioning
confidence: 99%
“…The architecture of the ASR system is same as described in Section 2.3, except that the speaker roles are removed from the transcript. Based on our past experience with conventional SD system, we built a strong baseline system consisting of the following five stages: (a) Speech detection and segmentation: This stage consists of an LSTM-based speech detector whose threshold is kept low to minimize deletion of speech segments [29]. (b) Speaker embedding: The speaker embeddings are computed using a sliding window of 1sec with a stride of 100ms.…”
Section: Baselinementioning
confidence: 99%
“…This is motivated for recent successes in directly modeling raw speech signal for various tasks, such as speech recognition [11,12,13], emotion recognition [14], voice activity detection [15], presentation attack detection [16], speaker recognition [17]. In particular, we build upon recent works [12,16,17] to investigate the following:…”
Section: Introductionmentioning
confidence: 99%
“…Different approaches have been proposed to feed the networks directly with the waveform of the audio signals instead of extracting features from the data as a first step, in order to develop end-to-end systems. The CLDNN [3,19] (acronym for convolutional LSTM DNN) architecture is specifically designed for such task, which is also referred to as feature learning. Related research in the field includes models like SincNet [20] or Wavenet [21], the latter being mainly proposed as a generative model for audio signals.…”
Section: Why Dnns In Speech and Music Detection?mentioning
confidence: 99%
“…Works on the detection of specific events can be also found, such as voice activity detection for recognizing the presence of human speech [1][2][3] or music activity detection, the analogous detection problem oriented to musical contents [4,5]. In both cases, the complexity of the problem does not come from the number of different event classes to be detected, but from the high variability of the contents found in speech and music signals.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation