2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016
DOI: 10.1109/icassp.2016.7472768
|View full text |Cite
|
Sign up to set email alerts
|

A comparative study of robustness of deep learning approaches for VAD

Abstract: Voice activity detection (VAD) is an important step for real-world automatic speech recognition (ASR) systems. Deep learning approaches, such as DNN, RNN or CNN, have been widely used in model-based VAD. Although they have achieved success in practice, they are developed on different VAD tasks separately. Whilst VAD performance under noisy conditions, especially with unseen noise or very low SNR, are of great interest, there has no robustness comparison of different deep learning approaches so far. In this pap… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
3
3
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 41 publications
(13 citation statements)
references
References 18 publications
0
13
0
Order By: Relevance
“…This makes it hard for an attacker to change an output decision without drastically changing the input (which may significantly hurt the semantic information and thus is infeasible for the attacker). RNN detectors are temporally deeper and can model long-range temporal features, thus, are more robust to adversarial attacks [37].…”
Section: B Adversarial Robustness Evaluation 1) Impact Of Dnn Architecturementioning
confidence: 99%
“…This makes it hard for an attacker to change an output decision without drastically changing the input (which may significantly hurt the semantic information and thus is infeasible for the attacker). RNN detectors are temporally deeper and can model long-range temporal features, thus, are more robust to adversarial attacks [37].…”
Section: B Adversarial Robustness Evaluation 1) Impact Of Dnn Architecturementioning
confidence: 99%
“…This makes it hard for an attacker to change an output decision without drastically changing the input (which may significantly hurt the semantic information and thus is infeasible for the attacker). RNN detectors are temporally deeper and can model long-range temporal features, thus, are more robust to adversarial attacks [33].…”
Section: B Adversarial Robustness Evaluationmentioning
confidence: 99%
“…F (•) ∈ R 2 denotes the neural network based function that outputs the conditional probability of each state, θ denotes its parameter, and [•]s t is the stth element of the vector. A long shortterm memory (LSTM) network is often used as F (•) to handle long-term dependencies between input features [27]. Cross entropy (CE) is often used as a loss function to estimate the parameter θ that maximizes (1) as…”
Section: Standard Voice Activity Detectionmentioning
confidence: 99%