Rapid Speaker Adaptation of Neural Network Based Filterbank Layer for Automatic Speech Recognition

Seki, Hiroshi; Yamamoto, Kazumasa; Akiba, Tomoyosi; Nakagawa, Seiichi

doi:10.1109/slt.2018.8639648

Cited by 5 publications

(7 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sailor and Patil [22] indeed showed that their proposed convolutional restricted Boltzmann machine (RBM) model learns different centre frequencies depending on the task at hand. Our work is perhaps most closely related to Seki et al [21], who proposed to adapt a filterbank composed of differentiable functions such as Gaussian or Gammatone filters. They demonstrated more than 7% relative reductions in WER when adapting to speakers in a spontaneous Japanese speech transcription task.…”

Section: Introductionmentioning

confidence: 92%

“…However, the filter gains may be suitable targets for adaptation for which we would like to attribute importance to the output of individual filters with a small number of parameters. This has similarly been done with learnable filterbanks in traditional feature extraction pipelines [21]. We also briefly note that if we were to scale the gain of each filter, then this would correspond to a version of feature-space Maximum Likelihood Linear Regression (fMLLR) [3] with a diagonal matrix and no bias, or similarly to Learning Hidden Unit Contritutions (LHUC) [2] which scales the output of each neuron by a scalar r (i) for filter i:…”

Section: Vtln Typically Uses a Scaling Function That Is Assumedmentioning

confidence: 99%

“…There are related approaches in literature that aim to learn, and update filterbanks on top of e.g. raw spectra [12,9,20,21]. As argued in these papers, fixed filterbanks may not be an optimal choice for a particular task.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Acoustic Model Adaptation from Raw Waveforms with Sincnet

Fainberg

Klejch

Loweimi

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Raw waveform acoustic modelling has recently gained interest due to neural networks' ability to learn feature extraction, and the potential for finding better representations for a given scenario than hand-crafted features. SincNet has been proposed to reduce the number of parameters required in rawwaveform modelling, by restricting the filter functions, rather than having to learn every tap of each filter. We study the adaptation of the SincNet filter parameters from adults' to children's speech, and show that the parameterisation of the SincNet layer is well suited for adaptation in practice: we can efficiently adapt with a very small number of parameters, producing error rates comparable to techniques using orders of magnitude more parameters.

show abstract

Section: Introductionmentioning

confidence: 92%

Section: Vtln Typically Uses a Scaling Function That Is Assumedmentioning

confidence: 99%

See 1 more Smart Citation

Acoustic Model Adaptation from Raw Waveforms with Sincnet

Fainberg

Klejch

Loweimi

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

show abstract

“…Then feature reduction/selection is one of the methods to solve this issue and many types of feature reduction employed in ASR. SVD is a popular and well-known method that was applied to test the performance of the recognition [15][16][17]. Therefore, SVD was employed to reduce number of features in this study.…”

Section: Literature Reviewmentioning

confidence: 99%

Real and Complex Wavelet Transform Using Singular Value Decomposition for Malaysian Speaker and Accent Recognition

Abdullah

Vijean

Hariharan

et al. 2020

Lecture Notes in Mechanical Engineering

View full text Add to dashboard Cite

This paper presents a new approach for Malaysian speaker and accent recognition using wavelet feature extraction method, namely Wavelet Packet Transform (WPT), Discrete Wavelet Packet Transform (DWPT) and Dual Tree Complex Wavelet Packet Transform (DT-CWPT). Since Singular Value Decomposition (SVD) was based on factorization and summarization technique which reduces a rectangular matric, it is applied on those features to evaluate the performance for speaker and accent recognition. The features are derived from wavelets and SVD classified with three different classifiers namely k-Nearest Neighbors (k-NN), Support Vector Machine (SVM) and Extreme Learning Machine (ELM). In this work, English digits (0-9) and Malay words database uttered from 75 undergraduate students of Universiti Malaysia Perlis (UniMAP) which are Malays, Chinese and Indian. The Malay words had a combination of consonants and vowels in monosyllable and bi-syllable structure. The accuracy of file-based analysis achieved were above 81% while for frame-based analysis, 93.87% and above were obtained using three different classifiers (k-NN, SVM and ELM) for speaker and accent recognition. Through the experiments, it is observed that accent recognition achieved high recognition rate of 100% for both framed-based analysis and file-based analysis using SVM. The experimental results show the proposed features using SVD achieved high accuracy of 100% using SVM through English digits and Malay words in accent recognition. This indicated that feature extraction using wavelets (WPT, DWPT and DT-CWPT) with SVD can achieve a good performance for both English digits and Malay words.

show abstract

“…The recognition performance of an Automatic Speech Recognition (ASR) system is affected by speaker variations. Speaker adaptation in conventional DNN-HMM based systems was explored in [1,2,3,4,5,6]. i-vectors appended to input features have been shown to improve the model performance.…”

Section: Introductionmentioning

confidence: 99%

Investigation of Speaker-adaptation methods in Transformer based ASR

Shetty¹,

J²,

Umesh³

2020

Preprint

View full text Add to dashboard Cite

End-to-end models are fast replacing conventional hybrid models in automatic speech recognition. A transformer is a sequence-to-sequence framework solely based on attention, that was initially applied to machine translation task. This end-toend framework has been shown to give promising results when used for automatic speech recognition as well. In this paper, we explore different ways of incorporating speaker information while training a transformer-based model to improve its performance. We present speaker information in the form of speaker embeddings for each of the speakers. Two broad categories of speaker embeddings are used: (i) fixed embeddings, and (ii) learned embeddings. We experiment using speaker embeddings learned along with the model training, as well as one-hot vectors and x-vectors. Using these different speaker embeddings, we obtain an average relative improvement of 1% to 3% in the token error rate. We report results on the NPTEL lecture database.NPTEL is an open-source e-learning portal providing content from top Indian universities.

show abstract

Rapid Speaker Adaptation of Neural Network Based Filterbank Layer for Automatic Speech Recognition

Cited by 5 publications

References 18 publications

Acoustic Model Adaptation from Raw Waveforms with Sincnet

Acoustic Model Adaptation from Raw Waveforms with Sincnet

Real and Complex Wavelet Transform Using Singular Value Decomposition for Malaysian Speaker and Accent Recognition

Investigation of Speaker-adaptation methods in Transformer based ASR

Contact Info

Product

Resources

About