A study of voice activity detection techniques for NIST speaker recognition evaluations

Mak, Man‐Wai; Yu, Hon-Bill

doi:10.1016/j.csl.2013.07.003

Cited by 107 publications

(50 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…1. A spectralsubtraction based voice activity detection (VAD) proposed in [24] is applied to detect the sound regions. MFCCs [6] and GFCCs [38][39][40] are extracted from the sound regions only.…”

Section: System Overviewmentioning

confidence: 99%

Robust scream sound detection via sound event partitioning

Lei

Mak

2015

Multimed Tools Appl

Self Cite

View full text Add to dashboard Cite

This paper proposes a robust scream-sound detection scheme for acoustic surveillance applications. To enhance the discriminability between scream and non-scream sounds, a sound-event partitioning (SEP) method that facilitates the extraction of multiple acoustic vectors from a single sound event is developed. Regularized principal component analysis (PCA) and normalization are applied to the acoustic vectors, which are then classified by support vector machines (SVMs). Experimental results based on 1000 sound events show that the proposed scheme is effective even if there are severe mismatches between the training and testing conditions. The experimental results also show that the proposed scheme can reduce the equal error rate (EER) by up to 60 % when compared to a classical approach that uses melfrequency cepstral coefficients (MFCC) as features. Extensive analyses on different processing stages of the proposed sound detection scheme also suggest that sound partitioning and feature normalization play important roles in boosting the detection performance.

show abstract

Section: System Overviewmentioning

confidence: 99%

Robust scream sound detection via sound event partitioning

Lei

Mak

2015

Multimed Tools Appl

Self Cite

View full text Add to dashboard Cite

show abstract

“…Also, a vast majority of our sub-systems use energy-based voice activity detector (VAD) in view of its simplicity and effectiveness. Other options for VAD that have been adopted are (i) VQ-VAD [21] in Sys1 and Sys14, (ii) speech/non-speech probabilities inferred from the DNN senone posterior in Sys9, and (iii) two-channel VAD [22] [14,15,27], there are a handful of our sub-systems (six out of seventeen in Table 3) that have successfully incorporated deep learning in one form or another: (i) Deep bottleneck feature (DBF) in Sys9, (ii) Stacked bottleneck feature in Sys11, (iii) DNN posterior in Sys2, Sys9, Sys10, Sys16, (iv) Splice time delay DNN (TDNN) [16] in Sys2, and (v) Denoising autoencoder in Sys14. For the bottleneck features in Sys9 we used a DNN with seven hidden layers each having 1024 hidden units except for the third layer with only 80 units.…”

Section: Train Development and Test Setsmentioning

confidence: 99%

The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016

Lee¹

2017

Interspeech 2017

View full text Add to dashboard Cite

The 2016 speaker recognition evaluation (SRE'16) is the latest edition in the series of benchmarking events conducted by the National Institute of Standards and Technology (NIST). I4U is a joint entry to SRE'16 as the result from the collaboration and active exchange of information among researchers from sixteen Institutes and Universities across 4 continents. The joint submission and several of its 32 sub-systems were among topperforming systems. A lot of efforts have been devoted to two major challenges, namely, unlabeled training data and dataset shift from Switchboard-Mixer to the new Call My Net dataset. This paper summarizes the lessons learned, presents our shared view from the sixteen research groups on recent advances, major paradigm shift, and common tool chain used in speaker recognition as we have witnessed in SRE'16. More importantly, we look into the intriguing question of fusing a large ensemble of sub-systems and the potential benefit of large-scale collaboration.

show abstract

“…Voice Activity Detection (VAD) is widely researched in audio signal processing and used for audio conferencing, speech encoding, speech recognition, and speaker recognition [17,26]. VAD methods detect voice activity (primarily speech) from a noisy audio signal [16,24,29]. Video content-based camera motion analysis methods make use of template matching [1] and optical flow [6].…”

Section: Focused Interaction Datasetmentioning

confidence: 99%

Finding Time Together: Detection and Classification of Focused Interaction in Egocentric Video

Bano

Zhang

McKenna

2017

2017 IEEE International Conference on Computer Vision Workshops (ICCVW)

View full text Add to dashboard Cite

Focused interaction occurs when co-present individuals, having mutual focus of attention, interact by establishing face-to-face engagement and direct conversation. Face-toface engagement is often not maintained throughout the entirety of a focused interaction. In this paper, we present an online method for automatic classification of unconstrained egocentric (first-person perspective) videos into segments having no focused interaction, focused interaction when the camera wearer is stationary and focused interaction when the camera wearer is moving. We extract features from both audio and video data streams and perform temporal segmentation by using support vector machines with linear and non-linear kernels. We provide empirical evidence that fusion of visual face track scores, camera motion profile and audio voice activity scores is an effective combination for focused interaction classification.

show abstract

A study of voice activity detection techniques for NIST speaker recognition evaluations

Cited by 107 publications

References 36 publications

Robust scream sound detection via sound event partitioning

Robust scream sound detection via sound event partitioning

The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016

Finding Time Together: Detection and Classification of Focused Interaction in Egocentric Video

Contact Info

Product

Resources

About