This work tackles the problem of learning a set of language specific acoustic units from unlabeled speech recordings given a set of labeled recordings from other languages. Our approach may be described by the following two steps procedure: first the model learns the notion of acoustic units from the labelled data and then the model uses its knowledge to find new acoustic units on the target language. We implement this process with the Bayesian Subspace Hidden Markov Model (SHMM), a model akin to the Subspace Gaussian Mixture Model (SGMM) where each low dimensional embedding represents an acoustic unit rather than just a HMM's state. The subspace is trained on 3 languages from the GlobalPhone corpus (German, Polish and Spanish) and the AUs are discovered on the TIMIT corpus. Results, measured in equivalent Phone Error Rate, show that this approach significantly outperforms previous HMM based acoustic units discovery systems and compares favorably with the Variational Auto Encoder-HMM.
Abstract-Recent developments in deep learning methods have greatly influenced the performances of speech recognition systems. In a Hidden Markov model-Deep neural network (HMM-DNN) based speech recognition system, DNNs have been employed to model senones (context dependent states of HMM), where HMMs capture the temporal relations among senones. Due to the use of more deeper networks significant improvement in the performances has been observed and developing deep learning methods to train more deeper architectures has gained a lot of scientific interest. Optimizing a deeper network is more complex task than to optimize a less deeper network, but recently residual network have exhibited a capability to train a very deep neural network architectures and are not prone to vanishing/exploding gradient problems. In this work, the effectiveness of residual networks have been explored for of speech recognition. Along with the depth of the residual network, the criticality of width of the residual network has also been studied. It has been observed that at higher depth, width of the networks is also a crucial parameter for attaining significant improvements. A 14-hour subset of WSJ corpus is used for training the speech recognition systems, it has been observed that the residual networks have shown much ease in convergence even with a depth much higher than the deep neural network. In this work, using residual networks an absolute reduction of 0.4 in WER error rates (8% reduction in the relative error) is attained compared to the best performing deep neural network.
End-to-End and cascade (ASR-MT) spoken language translation (SLT) systems are reaching comparable performances, however, a large degradation is observed when translating the ASR hypothesis in comparison to using oracle input text. In this work, degradation in performance is reduced by creating an End-to-End differentiable pipeline between the ASR and MT systems. In this work, we train SLT systems with ASR objective as an auxiliary loss and both the networks are connected through the neural hidden representations. This training has an End-to-End differentiable path with respect to the final objective function and utilizes the ASR objective for better optimization. This architecture has improved the BLEU score from 41.21 to 44.69. Ensembling the proposed architecture with independently trained ASR and MT systems further improved the BLEU score from 44.69 to 46.9. All the experiments are reported on English-Portuguese speech translation task using the How2 corpus. The final BLEU score is on-par with the best speech translation system on How2 dataset without using any additional training data and language model and using fewer parameters.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.