Features for voice activity detection: a comparative analysis

Graf, Simon; Herbig, Tobias; Buck, Markus; Schmidt, Gerhard

doi:10.1186/s13634-015-0277-z

Cited by 52 publications

(28 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A batch of training data comprises 128 sequences, with each sequence consisting of 20 feature vectors. The feature vector x t is computed from the observed signals every millisecond according to (6). The dataset consists of recordings of a desired target speaker, up to 4 simultaneously active interferers, and babble noise in the background.…”

Section: Implementation and Scenariosmentioning

confidence: 99%

“…In order to achieve a comparable performance both on small network sizes and small amounts of training data, the selection of feature vectors is indispensable. Classical approaches for Voice Activity Detection (VAD) are typically single-channel methods exploiting distinctive properties of speech signals like stationarity, harmonic structure and spectral envelopes in order to differentiate between speech and background noise [5,6]. These VAD methods, however, cannot be used to differentiate between a target speaker and interfering speech sources as the proposed TAD does.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient target activity detection based on recurrent neural networks

Gerber

Meier

Kellermann

2017

2017 Hands-Free Speech Communications and Microphone Arrays (HSCMA)

View full text Add to dashboard Cite

This paper addresses the problem of Target Activity Detection (TAD) for binaural listening devices. TAD denotes the problem of robustly detecting the activity of a target speaker in a harsh acoustic environment, which comprises interfering speakers and noise ('cocktail party scenario'). In previous work, it has been shown that employing a Feed-forward Neural Network (FNN) for detecting the target speaker activity is a promising approach to combine the advantage of different TAD features (used as network inputs). In this contribution, we exploit a larger context window for TAD and compare the performance of FNNs and Recurrent Neural Networks (RNNs) with an explicit focus on small network topologies as desirable for embedded acoustic signal processing systems. More specifically, the investigations include a comparison between three different types of RNNs, namely plain RNNs, Long Short-Term Memories, and Gated Recurrent Units. The results indicate that all versions of RNNs outperform FNNs for the task of TAD.

show abstract

Section: Implementation and Scenariosmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Efficient target activity detection based on recurrent neural networks

Gerber

Meier

Kellermann

2017

2017 Hands-Free Speech Communications and Microphone Arrays (HSCMA)

View full text Add to dashboard Cite

show abstract

“…Voice Activity Detection (VAD) is widely researched in audio signal processing and used for audio conferencing, speech encoding, speech recognition, and speaker recognition [17,26]. VAD methods detect voice activity (primarily speech) from a noisy audio signal [16,24,29]. Video content-based camera motion analysis methods make use of template matching [1] and optical flow [6].…”

Section: Focused Interaction Datasetmentioning

confidence: 99%

Finding Time Together: Detection and Classification of Focused Interaction in Egocentric Video

Bano

Zhang

McKenna

2017

2017 IEEE International Conference on Computer Vision Workshops (ICCVW)

View full text Add to dashboard Cite

Focused interaction occurs when co-present individuals, having mutual focus of attention, interact by establishing face-to-face engagement and direct conversation. Face-toface engagement is often not maintained throughout the entirety of a focused interaction. In this paper, we present an online method for automatic classification of unconstrained egocentric (first-person perspective) videos into segments having no focused interaction, focused interaction when the camera wearer is stationary and focused interaction when the camera wearer is moving. We extract features from both audio and video data streams and perform temporal segmentation by using support vector machines with linear and non-linear kernels. We provide empirical evidence that fusion of visual face track scores, camera motion profile and audio voice activity scores is an effective combination for focused interaction classification.

show abstract

“…It is a fact that SNR can be high at a single frequency point when speech (especially voiced frame) is present, even though the overall SNR of a signal is low (such as 0 dB) [13]. To each frequency point, the entropy of the R continuous frames before the current frame will abruptly become small when speech suddenly appears in the current frame.…”

Section: Proposed Approachmentioning

confidence: 99%

A priori SNR estimation and noise estimation for speech enhancement

Yao

Zeng

Zhu

2016

EURASIP J. Adv. Signal Process.

View full text Add to dashboard Cite

A priori signal-to-noise ratio (SNR) estimation and noise estimation are important for speech enhancement. In this paper, a novel modified decision-directed (DD) a priori SNR estimation approach based on single-frequency entropy, named DDBSE, is proposed. DDBSE replaces the fixed weighting factor in the DD approach with an adaptive one calculated according to change of single-frequency entropy. Simultaneously, a new noise power estimation approach based on unbiased minimum mean square error (MMSE) and voice activity detection (VAD), named UMVAD, is proposed. UMVAD adopts different strategies to estimate noise in order to reduce over-estimation and under-estimation of noise. UMVAD improves the classical statistical model-based VAD by utilizing an adaptive threshold to replace the original fixed one and modifies the unbiased MMSE-based noise estimation approach using an adaptive a priori speech presence probability calculated by entropy instead of the original fixed one. Experimental results show that DDBSE can provide greater noise suppression than DD and UMVAD can improve the accuracy of noise estimation. Compared to existing approaches, speech enhancement based on UMVAD and DDBSE can obtain a better segment SNR score and composite measure covl score, especially in adverse environments such as non-stationary noise and low-SNR.

show abstract

Features for voice activity detection: a comparative analysis

Cited by 52 publications

References 49 publications

Efficient target activity detection based on recurrent neural networks

Efficient target activity detection based on recurrent neural networks

Finding Time Together: Detection and Classification of Focused Interaction in Egocentric Video

A priori SNR estimation and noise estimation for speech enhancement

Contact Info

Product

Resources

About