Time Delay Neural Networks (TDNNs), also known as onedimensional Convolutional Neural Networks (1-d CNNs), are an efficient and well-performing neural network architecture for speech recognition. We introduce a factored form of TDNNs (TDNN-F) which is structurally the same as a TDNN whose layers have been compressed via SVD, but is trained from a random start with one of the two factors of each matrix constrained to be semi-orthogonal. This gives substantial improvements over TDNNs and performs about as well as TDNN-LSTM hybrids.
Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computational cost and improve the performance, we propose the state reuse chunk-SAE. Sencondly, the MTA based SAD truncates the speech features monotonically and performs attention on the truncated features. To support the online recognition, we integrate the state reuse chunk-SAE and the MTA based SAD into online CTC/attention architecture. We evaluate the proposed online models on the HKUST Mandarin ASR benchmark and achieve a 23.66% character error rate (CER) with a 320 ms latency. Our online model yields as little as 0.19% absolute CER degradation compared with the offline baseline, and achieves significant improvement over our prior work on Long Short-Term Memory (LSTM) based online E2E models.
Long Short-Term Memory networks (LSTMs) are a component of many state-of-the-art DNN-based speech recognition systems. Dropout is a popular method to improve generalization in DNN training. In this paper we describe extensive experiments in which we investigated the best way to combine dropout with LSTMs-specifically, projected LSTMs (LSTMP). We investigated various locations in the LSTM to place the dropout (and various combinations of locations), and a variety of dropout schedules. Our optimized recipe gives consistent improvements in WER across a range of datasets, including Switchboard, TED-LIUM and AMI.Projected LSTMs (LSTMPs) [4] are an important component of our baseline system, and to provide context for our explanation of dropout we will repeat the equations for them; here xt is the
The hybrid CTC/attention end-to-end automatic speech recognition (ASR) combines CTC ASR system and attention ASR system into a single neural network. Although the hybrid CTC/attention ASR system takes the advantages of both CTC and attention architectures in training and decoding, it remains challenging to be used for streaming speech recognition for its attention mechanism, CTC prefix probability and bidirectional encoder. In this paper, we propose a stable monotonic chunkwise attention (sMoChA) to stream its attention branch and a truncated CTC prefix probability (T-CTC) to stream its CTC branch. On the acoustic model side, we utilize the latencycontrolled bidirectional long short-term memory (LC-BLSTM) to stream its encoder. On the joint CTC/attention decoding side, we propose the dynamic waiting joint decoding (DWDJ) algorithm to collect the decoding hypotheses from the CTC and attention branches. Through the combination of the above methods, we stream the hybrid CTC/attention ASR system without much word error rate degradation.
Recently, there has been increasing progress in endto-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture, which utilizes the advantages of both CTC and attention. The hybrid CTC/attention ASR systems exhibit performance comparable to that of the conventional deep neural network (DNN) / hidden Markov model (HMM) ASR systems. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This paper describes our proposed online hybrid CTC/attention endto-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplify sMoChA and solve the training-and-decoding mismatch problem of sMoChA. Secondly, we propose truncated CTC (T-CTC) prefix score to stream CTC prefix score calculation. Thirdly, we design dynamic waiting joint decoding (DWJD) algorithm to dynamically collect the predictions of CTC and attention in an online manner. Finally, we use latency-controlled bidirectional long short-term memory (LC-BLSTM) to stream the widely-used offline bidirectional encoder network. Experiments with LibriSpeech English and HKUST Mandarin tasks demonstrate that, compared with the offline CTC/attention model, our proposed online CTC/attention model improves the real time factor in human-computer interaction services and maintains its performance with moderate degradation. To the best of our knowledge, this is the first work to provide the full-scale online solution for CTC/attention end-to-end ASR architecture.
To predict the probability of roadside accidents for curved sections on highways, we chose eight risk factors that may contribute to the probability of roadside accidents to conduct simulation tests and collected a total of 12,800 data obtained from the PC-crash software. The chi-squared automatic interaction detection (CHAID) decision tree technique was employed to identify significant risk factors and explore the influence of different combinations of significant risk factors on roadside accidents according to the generated decision rules, so as to propose specific improved countermeasures as the reference for the revision of the Design Specification for Highway Alignment (JTG D20-2017) of China. Considering the effects of related interactions among different risk factors on roadside accidents, path analysis was applied to investigate the importance of the significant risk factors. The results showed that the significant risk factors were in decreasing order of importance, vehicle speed, horizontal curve radius, vehicle type, adhesion coefficient, hard shoulder width, and longitudinal slope. The first five important factors were chosen as predictors of the probability of roadside accidents in the Bayesian network analysis to establish the probability prediction model of roadside accidents. Eventually, the thresholds of the various factors for roadside accident blackspot identification were given according to probabilistic prediction results.
The aims of this study were to achieve a quantitative assessment of the severity of accidents involving roadside trees on highways and to propose corresponding safety measures to reduce accident losses. This paper used the acceleration severity index (ASI), head injury criteria (HIC) and chest resultant acceleration (CRA) as indicators of occupant injuries and horizontal radii, vehicle departure speeds, tree diameters and roadside tree spacing as research variables to carry out bias collision tests between cars, trucks and trees by constructing a vehicle rigid body system and an occupant multibody system in PC-crash 10.0® simulation software. A total of 2,256 data points were collected. For straight and curved segments of highways, the occupant injury evaluation models of cars were fitted based on the CRA, and occupant injury evaluation models of trucks and cars were fitted based on the ASI. According to the Fisher optimal segmentation method, reasonable classification standards of severities of accidents involving roadside trees and the corresponding ASI and CRA thresholds were determined, and severity assessment methods for accidents involving roadside trees based on the CRA and ASI were provided. Additionally, a new index by which to evaluate the accuracy of the accident severity classification and the degree of misclassification was built and applied for the validity verification of the proposed severity assessment methods. A proportion of trucks was introduced to further improve the ASI evaluation model. For the same simulation conditions, the results show that driver chest injuries are more serious than driver head injuries and that the average ASI of cars is greater than that of trucks. The CRA and ASI have a positive linear correlation with the departure speed and a logarithmic correlation with the roadside tree diameters. The larger the spacing of roadside trees is and the smaller the horizontal radius is, the smaller the chance that a vehicle will experience a second collision and the lower the risk of occupant injury. In method validation, the evaluation results from two proposed severity assessment methods based on the CRA and ASI are consistent, and the degrees of misclassification are 4.65% and 4.26%, respectively, which verifies the accuracy of the methods proposed in this paper and confirms
In this paper, we describe the work on accelerating decoding speed while improving the decoding accuracy. Firstly, we propose an architecture which we call Projected Gated Recurrent Unit (PGRU) for automatic speech recognition (ASR) tasks, and show that the PGRU could outperform the standard GRU consistently. Secondly, in order to improve the PGRU's generalization, especially for large-scale ASR task, the Output-gate PGRU (OPGRU) is proposed. Finally, time delay neural network (TDNN) and normalization skills are found to be beneficial to the proposed projected-based GRU. The finally proposed unidirectional TDNN-OPGRU acoustic model achieves 3.3% / 4.5% relative reduction in word error rate (WER) compared with bidirectional projected LSTM (BLSTMP) on Eval2000 / RT03 test sets. Meanwhile, TDNN-OPGRU acoustic model speeds up the decoding speed by around 2.6 times compared with BLSTMP.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.