Xingyu Na scite author profile

In this paper we describe a method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training. We use the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI. To make its computation feasible we use a phone n-gram language model, in place of the word language model. To further reduce its space and time complexity we compute the objective function using neural network outputs at one third the standard frame rate. These changes enable us to perform the computation for the forward-backward algorithm on GPUs. Further the reduced output frame-rate also provides a significant speed-up during decoding. We present results on 5 different LVCSR tasks with training data ranging from 100 to 2100 hours. Models trained with LF-MMI provide a relative word error rate reduction of ∼11.5%, over those trained with cross-entropy objective function, and ∼8%, over those trained with cross-entropy and sMBR objective functions. A further reduction of ∼2.5%, relative, can be obtained by fine tuning these models with the word-lattice based sMBR objective function.

show abstract

AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline

Bu¹,

Du²,

Na³

et al. 2017

509

236

View full text Add to dashboard Cite

An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the largest corpus which is suitable for conducting the speech recognition research and building speech recognition systems for Mandarin. The recording procedure, including audio capturing devices and environments are presented in details. The preparation of the related resources, including transcriptions and lexicon are described. The corpus is released with a Kaldi recipe. Experimental results implies that the quality of audio recordings and transcriptions are promising.

show abstract

An empirical exploration of CTC acoustic models

Miao

Gowayyed

et al. 2016

View full text Add to dashboard Cite

AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline

Bu¹,

Du²,

Na³

et al. 2017

Preprint

View full text Add to dashboard Cite

Incremental Syllable-Context Phonetic Vocoding

Cerňak

Garner

Lazaridis

et al. 2015

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Abstract-Current very low bit rate speech coders are, due to complexity limitations, designed to work off-line. This paper investigates incremental speech coding that operates real-time and incrementally (i.e., encoded speech depends only on alreadyuttered speech without the need of future speech information). Since human speech communication is asynchronous (i.e., different information flows being simultaneously processed), we hypothesised that such an incremental speech coder should also operate asynchronously. To accomplish this task, we describe speech coding that reflects the human cortical temporal sampling that packages information into units of different temporal granularity, such as phonemes and syllables, in parallel. More specifically, a phonetic vocoder -cascaded speech recognition and synthesis systems -extended with syllable-based information transmission mechanisms is investigated. There are two main aspects evaluated in this work, the synchronous and asynchronous coding. Synchronous coding refers to the case when the phonetic vocoder and speech generation process depend on the syllable boundaries during encoding and decoding respectively. On the other hand, asynchronous coding refers to the case when the phonetic encoding and speech generation processes are done independently of the syllable boundaries. Our experiments confirmed that the asynchronous incremental speech coding performs better, in terms of intelligibility and overall speech quality, mainly due to better alignment of the segmental and prosodic information. The proposed vocoding operates at an uncompressed bit rate of 213 bits/sec and achieves an average communication delay of 243 ms.

show abstract

Data Augmentation For Children's Speech Recognition -- The "Ethiopian" System For The SLT 2021 Children Speech Recognition Challenge

Chen¹,

Na²,

Wang³

et al. 2020

Preprint

View full text Add to dashboard Cite

This paper presents the "Ethiopian" system for the SLT 2021 Children Speech Recognition Challenge. Various data processing and augmentation techniques are proposed to tackle children's speech recognition problem, especially the lack of the children's speech recognition training data issue. Detailed experiments are designed and conducted to show the effectiveness of each technique, across different speech recognition toolkits and model architectures.Step by step, we explain how we come up with our final system, which provides the state-of-the-art results in the SLT 2021 Children Speech Recognition Challenge, with 21.66% CER on the Track 1 evaluation set (4th place overall), and 16.53% CER on the Track 2 evaluation set (1st place overall). Post-challenge analysis shows that our system actually achieves 18.82% CER on the Track 1 evaluation set, but we submitted the wrong version to the challenge organizer for Track 1.

show abstract

Improving voice quality of HMM-based speech synthesis using voice conversion method

Jiao

Xie

et al. 2014

View full text Add to dashboard Cite

Low latency parameter generation for real-time speech synthesis system

Xie

Kuang

2014

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Xingyu Na

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI

AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline

An empirical exploration of CTC acoustic models

AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline

Incremental Syllable-Context Phonetic Vocoding

Data Augmentation For Children's Speech Recognition -- The "Ethiopian" System For The SLT 2021 Children Speech Recognition Challenge

Improving voice quality of HMM-based speech synthesis using voice conversion method

Low latency parameter generation for real-time speech synthesis system

Contact Info

Product

Resources

About