2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Sys 2017
DOI: 10.1109/icsda.2017.8384446
|View full text |Cite
|
Sign up to set email alerts
|

Linear-scale filterbank for deep neural network-based voice activity detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
1

Relationship

3
5

Authors

Journals

citations
Cited by 12 publications
(6 citation statements)
references
References 13 publications
0
6
0
Order By: Relevance
“…Despite the success of MFCCs for both speech and non-speech applications, some studies suggest that GTCCs outperform MFCCs in non-speech audio recognition (Valero and Alias, 2012) and in applications featuring diverse acoustic environments (Bonet-Solà and Alsina-Pagès, 2021). LFCCs have also been shown to consistently outperform MFCCs in applications that contain higher-frequency information of interest (Jung et al, 2017; Lei and Lopez-Gonzalo, 2009; Zhou et al, 2011).…”
Section: Methodsmentioning
confidence: 99%
“…Despite the success of MFCCs for both speech and non-speech applications, some studies suggest that GTCCs outperform MFCCs in non-speech audio recognition (Valero and Alias, 2012) and in applications featuring diverse acoustic environments (Bonet-Solà and Alsina-Pagès, 2021). LFCCs have also been shown to consistently outperform MFCCs in applications that contain higher-frequency information of interest (Jung et al, 2017; Lei and Lopez-Gonzalo, 2009; Zhou et al, 2011).…”
Section: Methodsmentioning
confidence: 99%
“…The LLFB features are extracted by a classical pipeline for filterbank-based features. Specifically, a signal goes through a pre-emphasis filter; is segmented into (overlapping) frames and a Hamming window function is applied to each frame (the frame length is 25 ms and the frame step size is 10 ms); afterwards, we use a 256 points short-time Fourier Transform (STFT) on each frame and calculate the power spectrum; and subsequently compute and apply a linear-scale filter banks with 80 triangular overlapping windows where center frequencies of the windows are equally spaced along a Hz scale [13]. By taking the logarithm of the power spectrogram and truncating these utterance-level features with the same processing used for the CQT feature, LLFB features with unified shapes of 80 × 400 (where 400 is the number of time frames) can be obtained.…”
Section: Front-end Processingmentioning
confidence: 99%
“…Since the vocal information of infant cry is distinctly characterized in the time-frequency domain, such as melodies and formants [14], [15], we considered a spectrogram-based representation of audio signals as the input of CNNs for feature learning. In this study, the log linear-scale filterbank (LLFB) spectrogram [16] was used to represent audio signals.…”
Section: Introductionmentioning
confidence: 99%