Sequestrin, a CD36 recognition protein on Plasmodium falciparum malaria-infected erythrocytes identified by anti-idiotype antibodies.

Sequence-to-sequence models have been widely used in end-toend speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes.

show abstract

ESPnet-ST: All-in-One Speech Translation Toolkit

Inaguma¹,

Kiyono²,

Duh³

et al. 2020

104

112

View full text Add to dashboard Cite

We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework. ESPnet-ST is a new project inside end-toend speech processing toolkit, ESPnet, which integrates or newly implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation. We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines for a wide range of benchmark datasets. Our reproducible results can match or even outperform the current state-of-the-art performances; these pretrained models are downloadable. The toolkit is publicly available at https://github. com/espnet/espnet.

show abstract

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

Inaguma

Gaur

et al. 2020

View full text Add to dashboard Cite

Recent Developments on Espnet Toolkit Boosted By Conformer

et al. 2021

View full text Add to dashboard Cite

In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of endto-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing state-of-the-art research environments usually requiring high resources.

show abstract

Multilingual End-to-End Speech Translation

Inaguma

Duh

Kawahara

et al. 2019

View full text Add to dashboard Cite

Recent Developments on ESPnet Toolkit Boosted by Conformer

Guo

Boyer

Chang

et al. 2020

Preprint

View full text Add to dashboard Cite

Transfer Learning of Language-independent End-to-end ASR with Language Model Fusion

Inaguma

Cho

Baskar

et al. 2019

View full text Add to dashboard Cite

This work explores better adaptation methods to low-resource languages using an external language model (LM) under the framework of transfer learning. We first build a language-independent ASR system in a unified sequence-to-sequence (S2S) architecture with a shared vocabulary among all languages. During adaptation, we perform LM fusion transfer, where an external LM is integrated into the decoder network of the attention-based S2S model in the whole adaptation stage, to effectively incorporate linguistic context of the target language. We also investigate various seed models for transfer learning. Experimental evaluations using the IARPA BA-BEL data set show that LM fusion transfer improves performances on all target five languages compared with simple transfer learning when the external text data is available. Our final system drastically reduces the performance gap from the hybrid systems.

show abstract

Acoustic-to-Word Attention-Based Model Complemented with Character-Level CTC-Based Model

Ueno

Inaguma

Mimura

et al. 2018

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Hirofumi Inaguma

A Comparative Study on Transformer vs RNN in Speech Applications

ESPnet-ST: All-in-One Speech Translation Toolkit

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

Recent Developments on Espnet Toolkit Boosted By Conformer

Multilingual End-to-End Speech Translation

Recent Developments on ESPnet Toolkit Boosted by Conformer

Transfer Learning of Language-independent End-to-end ASR with Language Model Fusion

Acoustic-to-Word Attention-Based Model Complemented with Character-Level CTC-Based Model

Contact Info

Product

Resources

About