Robust Speech Activity Detection in Movie Audio: Data Resources and Experimental Evaluation

Hebbar, Rajat; Somandepalli, Krishna; Narayanan, Shrikanth

doi:10.1109/icassp.2019.8682532

Cited by 11 publications

(28 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We performed both architecture search as well as hyperparameter tuning for determining the best-performing model architecture and the number of convolutional blocks and hidden layer dimensions for recurrent and fully connected layers therein. The CNN-based architectures include standard CNN, CNN-GAP, CLDNN, and CNN-TD models [27]. The difference in these architectures is in the handling of the final output of the convolutional layers.…”

Section: Neural Network Architecturesmentioning

confidence: 99%

“…The 3000 x 64-dimensional features are then reduced to binary class posteriors for foreground classification. Embeddings from a speech activity detection model trained on movie data [27] are used for the purposes of transfer learning for foreground detection task. Convolutional neural network models were trained on 0.64 s duration audio segments for a two class speech/non-speech classification problem.…”

Section: Features For Foreground Detectionmentioning

confidence: 99%

See 1 more Smart Citation

Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

Hebbar¹,

Papadopoulos²,

Reyes³

et al. 2021

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.

show abstract

Section: Neural Network Architecturesmentioning

confidence: 99%

Section: Features For Foreground Detectionmentioning

confidence: 99%

Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

Hebbar¹,

Papadopoulos²,

Reyes³

et al. 2021

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

show abstract

“…2 but with only two view-branches. [23] and speech activity detection [24]. The CNN of the viewbranches in our model is a smaller version of that in [24], and is shown in Fig.…”

Section: I-vectormentioning

confidence: 99%

Multiview Shared Subspace Learning Across Speakers and Speech Commands

Somandepalli¹,

Kumar²,

Jati³

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

In many speech processing applications, the objective is to model different modes of variability to obtain robust speech features. In this paper, we learn speech representations in a multiview paradigm by constraining the views to known modes of variability such as speakers or spoken words. We use deep multiset canonical correlation (dMCCA) because it can model more than two views in parallel to learn a shared subspace across them. In order to model thousands of views (e.g., speakers), we demonstrate that stochastically sampling a small number of views generalizes dMCCA to the larger set of views. To evaluate our approach, we study two different aspects of the Speech Commands Dataset: variability among the speakers and speech commands. We show that, by treating observations from one mode of variability as multiple parallel views, we can learn representations that are discriminative to the other mode. We first consider different speakers as views of the same word to learn their shared subspace to represent an utterance. We then constrain the different words spoken by the same person as multiple views to learn speaker representations. Using classification and unsupervised clustering, we evaluate the efficacy of multiview representations to identify speech commands and speakers.

show abstract

“…Apart from feature-based [1,2,3] and statistical modeling approaches [4,5], recent research effort has been devoted to finding efficient deep-learning-based VAD model architectures. Notable examples include Recurrent Neural Networks (RNN) [6,7,8], Convolutional Neural Networks (CNN) [9,10,11,12], and Convolutional Long Short-Term Memory (LSTM) Deep Neural Networks (CLDNN) [13], which conduct frequency modeling with CNN and temporal modeling with LSTM. LSTM is a popular choice for sequential modeling of VAD tasks [13,6].…”

Section: Introductionmentioning

confidence: 99%

“…They also demonstrated that CNNs were useful acoustic models in novel channel scenarios and able to adapt well with limited amounts of data. Hebbar et al [11] compared different LSTM and CNN models with more challenging movie data which contained post-production stage and atypical speech such as electronically modified speech samples. They proposed a Convolutional Neural Network-Time Distributed (CNN-TD) model (with 740K parameters) that outperformed existing models including Bi-LSTM (with 300K parameters), CLDNN (with 1M parameters) and ResNet 960 (with 30M parameters) [9] on the benchmark evaluation dataset AVA-speech.…”

Section: Introductionmentioning

confidence: 99%

MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection

Jia

Majumdar

Ginsburg

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We present MarbleNet, an end-to-end neural network for Voice Activity Detection (VAD). MarbleNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batchnormalization, ReLU and dropout layers. When compared to a stateof-the-art VAD model, MarbleNet is able to achieve similar performance with roughly 1/10-th the parameter cost. We further conduct extensive ablation studies on different training methods and choices of parameters in order to study the robustness of MarbleNet in realworld VAD tasks.

show abstract

Robust Speech Activity Detection in Movie Audio: Data Resources and Experimental Evaluation

Cited by 11 publications

References 17 publications

Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

Multiview Shared Subspace Learning Across Speakers and Speech Commands

MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection

Contact Info

Product

Resources

About