2014
DOI: 10.1016/j.csl.2013.07.003
|View full text |Cite
|
Sign up to set email alerts
|

A study of voice activity detection techniques for NIST speaker recognition evaluations

Abstract: Since 2008, interview-style speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/non-speech segmentation in these files. To overcome these difficulties, this paper proposes using speech enhancement techniques as a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
48
0
1

Year Published

2015
2015
2024
2024

Publication Types

Select...
5
3
2

Relationship

2
8

Authors

Journals

citations
Cited by 107 publications
(50 citation statements)
references
References 36 publications
1
48
0
1
Order By: Relevance
“…1. A spectralsubtraction based voice activity detection (VAD) proposed in [24] is applied to detect the sound regions. MFCCs [6] and GFCCs [38][39][40] are extracted from the sound regions only.…”
Section: System Overviewmentioning
confidence: 99%
“…1. A spectralsubtraction based voice activity detection (VAD) proposed in [24] is applied to detect the sound regions. MFCCs [6] and GFCCs [38][39][40] are extracted from the sound regions only.…”
Section: System Overviewmentioning
confidence: 99%
“…Also, a vast majority of our sub-systems use energy-based voice activity detector (VAD) in view of its simplicity and effectiveness. Other options for VAD that have been adopted are (i) VQ-VAD [21] in Sys1 and Sys14, (ii) speech/non-speech probabilities inferred from the DNN senone posterior in Sys9, and (iii) two-channel VAD [22] [14,15,27], there are a handful of our sub-systems (six out of seventeen in Table 3) that have successfully incorporated deep learning in one form or another: (i) Deep bottleneck feature (DBF) in Sys9, (ii) Stacked bottleneck feature in Sys11, (iii) DNN posterior in Sys2, Sys9, Sys10, Sys16, (iv) Splice time delay DNN (TDNN) [16] in Sys2, and (v) Denoising autoencoder in Sys14. For the bottleneck features in Sys9 we used a DNN with seven hidden layers each having 1024 hidden units except for the third layer with only 80 units.…”
Section: Train Development and Test Setsmentioning
confidence: 99%
“…Voice Activity Detection (VAD) is widely researched in audio signal processing and used for audio conferencing, speech encoding, speech recognition, and speaker recognition [17,26]. VAD methods detect voice activity (primarily speech) from a noisy audio signal [16,24,29]. Video content-based camera motion analysis methods make use of template matching [1] and optical flow [6].…”
Section: Focused Interaction Datasetmentioning
confidence: 99%