2023
DOI: 10.3390/electronics12030705
|View full text |Cite
|
Sign up to set email alerts
|

Supervised Contrastive Learning for Voice Activity Detection

Abstract: The noise robustness of voice activity detection (VAD) tasks, which are used to identify the human speech portions of a continuous audio signal, is important for subsequent downstream applications such as keyword spotting and automatic speech recognition. Although various aspects of VAD have been recently studied by researchers, a proper training strategy for VAD has not received sufficient attention. Thus, a training strategy for VAD using supervised contrastive learning is proposed for the first time in this… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 23 publications
0
1
0
Order By: Relevance
“…In [6], a VAD learning strategy using Supervised Contrastive Learning (Supervised Contrastive Learning for Voice Activity Detection, SCLVAD) was proposed for the first time. The proposed method was used in combination with audio-specific data augmentation methods, which were trained using two common sets of English language speech data: the Google Speech Commands Dataset V2 and audio samples from the site freesound.org, and then evaluated using the third AVA-Speech English dataset.…”
Section: Literature Review and Problem Statementmentioning
confidence: 99%
“…In [6], a VAD learning strategy using Supervised Contrastive Learning (Supervised Contrastive Learning for Voice Activity Detection, SCLVAD) was proposed for the first time. The proposed method was used in combination with audio-specific data augmentation methods, which were trained using two common sets of English language speech data: the Google Speech Commands Dataset V2 and audio samples from the site freesound.org, and then evaluated using the third AVA-Speech English dataset.…”
Section: Literature Review and Problem Statementmentioning
confidence: 99%
“…These two advantages of SCL settings provide more efficient feature learning over the SSCL approach. Several studies in the audio domain have effectively applied SCL, for instance, in environmental sound classification [31], voice activity detection [32], accented speech recognition [33], and musical onset detection [34], exhibiting superior performance when compared to models trained using cross-entropy.…”
Section: Related Workmentioning
confidence: 99%