Akihiro Ogawa scite author profile

The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-tosequence transformation tasks. The advantage of this architecture is that it has a fast iteration speed in the training stage because there is no sequential operation as with recurrent neural networks (RNN). However, an RNN is still the best option for end-to-end automatic speech recognition (ASR) tasks in terms of overall training speed (i.e., convergence) and word error rate (WER) because of effective joint training and decoding methods. To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult. To address these problems, we integrate connectionist temporal classification (CTC) with Transformer for joint training and decoding. This approach makes training faster than with RNNs and assists LM integration. Our proposed ASR system realizes significant improvements in various ASR tasks. For example, it reduced the WERs from 11.1% to 4.5% on the Wall Street Journal and from 16.1% to 11.6% on the TED-LIUM by introducing CTC and LM integration into the Transformer baseline.

show abstract

Single Channel Target Speaker Extraction and Recognition with Speaker Beam

Delcroix

et al. 2018

View full text Add to dashboard Cite

A highly ordered ferrocene system regulated by podand peptide chains

Nomoto¹,

Moriuchi²,

Yamazaki³

et al. 1998

Chem. Commun.

121

View full text Add to dashboard Cite

Strategies for distant speech recognitionin reverberant environments

Delcroix

Yoshioka

Ogawa

et al. 2015

EURASIP J. Adv. Signal Process.

View full text Add to dashboard Cite

Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Akihiro Ogawa

The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices

Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

Single Channel Target Speaker Extraction and Recognition with Speaker Beam

A highly ordered ferrocene system regulated by podand peptide chains

Strategies for distant speech recognitionin reverberant environments

Contact Info

Product

Resources

About