Evolution-Strategy-Based Automation of System Development for High-Performance Speech Recognition

Moriya, Takafumi; Tanaka, Tomohiro; Shinozaki, Takahiro; Watanabe, Shinji; Duh, Kevin

doi:10.1109/taslp.2018.2871755

Cited by 15 publications

(8 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data Types: Evolutionary DNN construction approaches have been applied to various data types, such as images [13], [59], [108], [109], speech [128], [133], [148], and texts [15], [110]. In particular, tremendous research effort has been devoted to solving the image classification problem.…”

Section: A Applicationsmentioning

confidence: 99%

See 1 more Smart Citation

A Survey on Evolutionary Construction of Deep Neural Networks

Xun

Qin

Gong

et al. 2021

IEEE Trans. Evol. Computat.

View full text Add to dashboard Cite

Automated construction of deep neural networks (DNNs) has become a research hot spot nowadays because DNN's performance is heavily influenced by its architecture and parameters, which are highly task-dependent, but it is notoriously difficult to find the most appropriate DNN in terms of architecture and parameters to best solve a given task. In this work, we provide an insight into the automated DNN construction process by formulating it into a multilevel multiobjective large-scale optimization problem with constraints, where the nonconvex, nondifferentiable, and black-box nature of this problem make evolutionary algorithms (EAs) to stand out as a promising solver. Then, we give a systematical review of existing evolutionary DNN construction techniques from different aspects of this optimization problem and analyze the pros and cons of using EA-based methods in each aspect. This work aims to help DNN researchers to better understand why, where, and how to utilize EAs for automated DNN construction and meanwhile, help EA researchers to better understand the task of automated DNN construction so that they may focus more on EA-favored optimization scenarios to devise more effective techniques.

show abstract

Section: A Applicationsmentioning

confidence: 99%

“…On CIFAR-100 and ImageNet, the DNN models constructed by EAs have achieved competitive performance in comparison to handcrafted DNN models [13], [15]. Besides image classification, the DNN models designed by EAs have demonstrated great successes in object identification [17], [185], speech recognition [128], and emotion recognition [40].…”

Section: A Applicationsmentioning

confidence: 99%

A Survey on Evolutionary Construction of Deep Neural Networks

Xun

Qin

Gong

et al. 2021

IEEE Trans. Evol. Computat.

View full text Add to dashboard Cite

show abstract

“…The LAS model is a Seq2Seq model based on the attention mechanism. Its goal is to maximize the conditional probability of the output character sequence under the given conditions for speech inputs [28]- [29]. The model is trained directly with the input speech feature sequence and its corresponding character sequence.…”

Section: A Listen Attend and Spell Modelmentioning

confidence: 99%

End-to-End Amdo-Tibetan Speech Recognition Based on Knowledge Transfer

Zhu

Huang

2020

IEEE Access

View full text Add to dashboard Cite

The end-to-end speech recognition technology solves the problem that each component is independent and models cannot be jointly optimized in the traditional speech recognition model. It incorporates such components as the acoustic model, language model, and decoding unit of the hybrid model into a single neural network, that can avoid the inherent defects of multiple modules and greatly reduces the complexity of the speech recognition model. In this research, an Amdo-Tibetan speech recognition system is constructed based on Listen, Attend and Spell (LAS) model by the end-to-end speech recognition technology. It can realize the direct conversion from Amdo-Tibetan speech sequence to the corresponding character sequence and greatly reduces the difficulty of building the Amdo-Tibetan speech recognition model. To further improve the performance of the proposed system, the following improvements have been made: firstly, the Multi-Head attention mechanism is introduced to improve the alignment accuracy between state vectors of decoder and encoder; secondly, the label smoothing technique is adopted to solve the problem of over-fitting; thirdly, an N-gram language model is combined with the LAS model to increase the accuracy of speech recognition and the maximum mutual information (MMI) criterion is employed for discriminative training; and finally, transfer learning is utilized to overcome the problem of insufficient training data. Experimental results show that the proposed model can significantly enhance the performance of Amdo-Tibetan speech recognition.

show abstract

“…1) This paper presents the first use of DARTS based NAS techniques to automatically learn architecture hyperparameters that directly affect the performance and model complexity of state-of-the-art LF-MMI trained TDNN-F acoustic models. In contrast, previous NAS researches conducted on similar systems either used a) evolutionary algorithms requiring expert setting of initial genes and long evaluation time for each individual candidate architecture [35] (up to 4 days even with a manual early-stopping mechanism) while in our NAS approaches the entire architecture search is performed over all possible 7 28 candidate systems and model training cycle is limited to approximately 6.6 GPU days; or b) architecture sampling based straight-through gradient approach [39] on a TDNN-CTC end-to-end system producing much higher WERs (12.6% and 23.2%) on the swbd and callhm subsets of Hub5' 00 test set than our NAS auto-configured TDNN-F systems on the same data (6.9% and 13.0%) presented in this paper. 2) To facilitate efficient search over a very large number of TDNN-F systems, this paper presents the first use of a flexible model parameter sharing scheme that is tailor-designed for specific hyper-parameters contained in TDNN-F systems, to the best of our knowledge.…”

Section: Introductionmentioning

confidence: 99%

“…4) This paper further presents the earliest work on analysing the efficacy of NAS approaches when being used to minimize the structural redundancy in the TDNN-F systems and reduce their model parameter uncertainty when given limited training data. In contrast, only speech recognition accuracy performance and model size reduction are investigated in previous researches [35]- [42], [70]. The rest of this paper is organized as follows.…”

Section: Introductionmentioning

confidence: 99%

Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks

Hu¹,

Xie²,

Cui³

et al. 2022

Preprint

View full text Add to dashboard Cite

State-of-the-art automatic speech recognition (ASR) system development is data and computation intensive. The optimal design of deep neural networks (DNNs) for these systems often require expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of factored time delay neural networks (TDNN-Fs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer. These techniques include the differentiable neural architecture search (DARTS) method integrating architecture learning with latticefree MMI training; Gumbel-Softmax and pipelined DARTS methods reducing the confusion over candidate architectures and improving the generalization of architecture selection; and Penalized DARTS incorporating resource constraints to balance the trade-off between performance and system complexity. Parameter sharing among TDNN-F architectures allows an efficient search over up to 7 28 different systems. Statistically significant word error rate (WER) reductions of up to 1.2% absolute and relative model size reduction of 31% were obtained over a stateof-the-art 300-hour Switchboard corpus trained baseline LF-MMI TDNN-F system featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation as well as RNNLM rescoring. Performance contrasts on the same task against recent end-to-end systems reported in the literature suggest the best NAS auto-configured system achieves state-of-the-art WERs of 9.9% and 11.1% on the NIST Hub5' 00 and Rt03s test sets respectively with up to 96% model size reduction. Further analysis using Bayesian learning shows that the proposed NAS approaches can effectively minimize the structural redundancy in the TDNN-F systems and reduce their model parameter uncertainty. Consistent performance improvements were also obtained on a UASpeech dysarthric speech recognition task.

show abstract

Evolution-Strategy-Based Automation of System Development for High-Performance Speech Recognition

Cited by 15 publications

References 25 publications

A Survey on Evolutionary Construction of Deep Neural Networks

A Survey on Evolutionary Construction of Deep Neural Networks

End-to-End Amdo-Tibetan Speech Recognition Based on Knowledge Transfer

Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks

Contact Info

Product

Resources

About