Auditory feature representation using convolutional restricted Boltzmann machine and Teager energy operator for speech recognition

Sailor, Hardik B.; Patil, Hemant A.

doi:10.1121/1.4983751

Cited by 11 publications

(8 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our future works includes detailed analysis of natural and spoof speech regarding the nature of subband filters and frequency scale. We would also like use our Unsupervised Deep Auditory Model (UDAM) [30] along with TEO [31] for the SSD task.…”

Section: Discussionmentioning

confidence: 99%

Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection

2017

Self Cite

View full text Add to dashboard Cite

Speech Synthesis (SS) and Voice Conversion (VC) presents a genuine risk of attacks for Automatic Speaker Verification (ASV) technology. In this paper, we use our recently proposed unsupervised filterbank learning technique using Convolutional Restricted Boltzmann Machine (ConvRBM) as a frontend feature representation. ConvRBM is trained on training subset of ASV spoof 2015 challenge database. Analyzing the filterbank trained on this dataset shows that ConvRBM learned more low-frequency subband filters compared to training on natural speech database such as TIMIT. The spoofing detection experiments were performed using Gaussian Mixture Models (GMM) as a back-end classifier. ConvRBM-based cepstral coefficients (ConvRBM-CC) perform better than hand crafted Mel Frequency Cepstral Coefficients (MFCC). On the evaluation set, ConvRBM-CC features give an absolute reduction of 4.76 % in Equal Error Rate (EER) compared to MFCC features. Specifically, ConvRBM-CC features significantly perform better in both known attacks (1.93 %) and unknown attacks (5.87 %) compared to MFCC features.

show abstract

Section: Discussionmentioning

confidence: 99%

Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection

2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…Compared to our earlier work in [13], [14], we have used noisy leaky rectifier linear units (NLReLU) proposed in [19] to avoid the limitations of ReLU. Annealing dropout is applied in the ConvRBM training with the annealing schedule chosen in [20]. The ConvRBM training is performed using contrastive divergence (CD) [21].…”

Section: Convrbm For Auditory Filterbank Learningmentioning

confidence: 99%

“…The moment parameters of Adam optimization was chosen to be β1=0.5, and β2=0.999. The annealing dropout probability was chosen to be 0.3 based on our earlier experiments in the ASR [20] and environmental sound classification [30]. After the model was trained, the features were extracted from the speech signal as discussed in Section 2.2.…”

Section: Training Of Convrbm and Feature Extractionmentioning

confidence: 99%

Auditory Filterbank Learning for Temporal Modulation Features in Replay Spoof Speech Detection

2018

View full text Add to dashboard Cite

In this paper, we present a standalone replay spoof speech detection (SSD) system to classify the natural vs. replay speech. The replay speech spectrum is known to be affected in the higher frequency range. In this context, we propose to exploit an auditory filterbank learning using Convolutional Restricted Boltzmann Machine (ConvRBM) with the pre-emphasized speech signals. Temporal modulations in amplitude (AM) and frequency (FM) are extracted from the ConvRBM subbands using the Energy Separation Algorithm (ESA). ConvRBM-based short-time AM and FM features are developed using cepstral processing, denoted as AM-ConvRBM-CC and FM-ConvRBM-CC. Proposed temporal modulation features performed better than the baseline Constant-Q Cepstral Coefficients (CQCC) features. On the evaluation set, an absolute reduction of 7.48 % and 5.28 % in Equal Error Rate (EER) is obtained using AM-ConvRBM-CC and FM-ConvRBM-CC, respectively compared to our CQCC baseline. The best results are achieved by combining scores from AM and FM cues (0.82 % and 8.89 % EER for development and evaluation set, respectively). The statistics of AM-FM features are analyzed to understand the performance gap and complementary information in both the features.

show abstract

“…For analysis of the subband filters, we first sort it according to the center frequencies (CFs) of the subband filters as done Figure 1: Block diagram of the proposed ConvRBM with dropout mask. After [21], [22]. in [21].…”

Section: Analysis Of Filterbank 31 Analysis Of Subband Filtersmentioning

confidence: 99%

“…In this paper, we propose to exploit ConvRBM as a frontend for filterbank learning from the raw audio signals. Compared to our earlier works in [20], [21] and [22], here we have used Adam optimization [23] along with an annealed dropout technique [24]. Invariant representation is learned from the raw audio using ConvRBM and higher-level invariance is achieved using supervised CNN as a classifier.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification

2017

Self Cite

View full text Add to dashboard Cite

In this paper, we propose to use Convolutional Restricted Boltzmann Machine (ConvRBM) to learn filterbank from the raw audio signals. ConvRBM is a generative model trained in an unsupervised way to model the audio signals of arbitrary lengths. ConvRBM is trained using annealed dropout technique and parameters are optimized using Adam optimization. The subband filters of ConvRBM learned from the ESC-50 database resemble Fourier basis in the mid-frequency range while some of the low-frequency subband filters resemble Gammatone basis. The auditory-like filterbank scale is nonlinear w.r.t. the center frequencies of the subband filters and follows the standard auditory scales. We have used our proposed model as a front-end for the Environmental Sound Classification (ESC) task with supervised Convolutional Neural Network (CNN) as a back-end. Using CNN classifier, the ConvRBM filterbank (ConvRBM-BANK) and its score-level fusion with the Mel filterbank energies (FBEs) gave an absolute improvement of 10.65 %, and 18.70 % in the classification accuracy, respectively, over FBEs alone on the ESC-50 database. This shows that the proposed ConvRBM filterbank also contains highly complementary information over the Mel filterbank, which is helpful in the ESC task.

show abstract

Auditory feature representation using convolutional restricted Boltzmann machine and Teager energy operator for speech recognition

Cited by 11 publications

References 12 publications

Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection

Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection

Auditory Filterbank Learning for Temporal Modulation Features in Replay Spoof Speech Detection

Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification

Contact Info

Product

Resources

About