Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection

Zazo, Rubén; Sainath, Tara N.; Simko, Gabor; Parada, Carolina

doi:10.21437/interspeech.2016-268

Cited by 81 publications

(54 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The architecture of the ASR system is same as described in Section 2.3, except that the speaker roles are removed from the transcript. Based on our past experience with conventional SD system, we built a strong baseline system consisting of the following five stages: (a) Speech detection and segmentation: This stage consists of an LSTM-based speech detector whose threshold is kept low to minimize deletion of speech segments [29]. (b) Speaker embedding: The speaker embeddings are computed using a sliding window of 1sec with a stride of 100ms.…”

Section: Baselinementioning

confidence: 99%

Joint Speech Recognition and Speaker Diarization via Sequence Transduction

Shafey¹,

Soltau²,

Shafran³

2019

Interspeech 2019

View full text Add to dashboard Cite

Speech applications dealing with conversations require not only recognizing the spoken words, but also determining who spoke when. The task of assigning words to speakers is typically addressed by merging the outputs of two separate systems, namely, an automatic speech recognition (ASR) system and a speaker diarization (SD) system. The two systems are trained independently with different objective functions. Often the SD systems operate directly on the acoustics and are not constrained to respect word boundaries and this deficiency is overcome in an ad hoc manner. Motivated by recent advances in sequence to sequence learning, we propose a novel approach to tackle the two tasks by a joint ASR and SD system using a recurrent neural network transducer. Our approach utilizes both linguistic and acoustic cues to infer speaker roles, as opposed to typical SD systems, which only use acoustic cues. We evaluated the performance of our approach on a large corpus of medical conversations between physicians and patients. Compared to a competitive conventional baseline, our approach improves word-level diarization error rate from 15.8% to 2.2%.

show abstract

Section: Baselinementioning

confidence: 99%

Joint Speech Recognition and Speaker Diarization via Sequence Transduction

Shafey¹,

Soltau²,

Shafran³

2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…This is motivated for recent successes in directly modeling raw speech signal for various tasks, such as speech recognition [11,12,13], emotion recognition [14], voice activity detection [15], presentation attack detection [16], speaker recognition [17]. In particular, we build upon recent works [12,16,17] to investigate the following:…”

Section: Introductionmentioning

confidence: 99%

On Learning to Identify Genders from Raw Speech Signal Using CNNs

Kabil¹,

Muckenhirn²,

Magimai.-Doss³

2018

Interspeech 2018

View full text Add to dashboard Cite

Automatic Gender Recognition (AGR) is the task of identifying the gender of a speaker given a speech signal. Standard approaches extract features like fundamental frequency and cepstral features from the speech signal and train a binary classifier. Inspired from recent works in the area of automatic speech recognition (ASR), speaker recognition and presentation attack detection, we present a novel approach where relevant features and classifier are jointly learned from the raw speech signal in end-to-end manner. We propose a convolutional neural networks (CNN) based gender classifier that consists of: (1) convolution layers, which can be interpreted as a feature learning stage and (2) a multilayer perceptron (MLP), which can be interpreted as a classification stage. The system takes raw speech signal as input, and outputs gender posterior probabilities. Experimental studies conducted on two datasets, namely AVspoof and ASVspoof 2015, with different architectures show that with simple architectures the proposed approach yields better system than standard acoustic features based approach. Further analysis of the CNNs show that the CNNs learn formant and fundamental frequency information for gender identification.

show abstract

“…Different approaches have been proposed to feed the networks directly with the waveform of the audio signals instead of extracting features from the data as a first step, in order to develop end-to-end systems. The CLDNN [3,19] (acronym for convolutional LSTM DNN) architecture is specifically designed for such task, which is also referred to as feature learning. Related research in the field includes models like SincNet [20] or Wavenet [21], the latter being mainly proposed as a generative model for audio signals.…”

Section: Why Dnns In Speech and Music Detection?mentioning

confidence: 99%

“…Works on the detection of specific events can be also found, such as voice activity detection for recognizing the presence of human speech [1][2][3] or music activity detection, the analogous detection problem oriented to musical contents [4,5]. In both cases, the complexity of the problem does not come from the number of different event classes to be detected, but from the high variability of the contents found in speech and music signals.…”

Section: Introductionmentioning

confidence: 99%

“…FConn: L =[2,3,4,5,6], N =[16,32, 64, 128, 256, 512, 1024, 2048] (40 networks) • CNN3 × 3: L = [4, 5, 6, 7], N = [32, 64, 128, 256] (16 networks) • CNN7 × 7: L = [6, 7], N = [32, 64, 128, 256] (8 networks) • LSTM: L = [1, 2, 3], N = [32, 64, 128, 256] (12 networks) • C1-LSTM: L = [1, 2, 3, 4], N = [32, 64, 128, 256] (16 networks) • C2-LSTM: L = [6, 7], N = [32, 64, 128, 256] (8 networks)…”

mentioning

confidence: 99%

See 1 more Smart Citation

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Benito-Gorrón

Lozano-Díez

Toledano

et al. 2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The first one is the training of two different neural networks, one for speech detection and another for music detection. The second approach consists on training a single neural network to tackle both tasks at the same time. The studied architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks. Comparative results are provided in terms of classification performance and model complexity. We would like to highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore, a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most harmful for the performance of the models, showing some difficult scenarios for the detection of music and speech.

show abstract

Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection

Cited by 81 publications

References 13 publications

Joint Speech Recognition and Speaker Diarization via Sequence Transduction

Joint Speech Recognition and Speaker Diarization via Sequence Transduction

On Learning to Identify Genders from Raw Speech Signal Using CNNs

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Contact Info

Product

Resources

About