Naoki Makishima scite author profile

In this paper, we propose a new framework called independent deeply learned matrix analysis (IDLMA), which unifies a deep neural network (DNN) and independence-based multichannel audio source separation. IDLMA utilizes both pretrained DNN source models and statistical independence between sources for the separation, where the time-frequency structures of each source are iteratively optimized by a DNN while enhancing the estimation accuracy of the spatial demixing filters. As the source generative model, we introduce a complex heavy-tailed distribution to improve the separation performance. In addition, we address a semi-supervised situation; namely, a solo-recorded audio dataset can be prepared for only one source in the mixture signal. To solve the limited-data problem, we propose an appropriate data augmentation method to adapt the DNN source models to the observed signal, which enables IDLMA to work even in the semi-supervised situation. Experiments are conducted using music signals with a training dataset in both supervised and semi-supervised situations. The results show the validity of the proposed method in terms of the separation accuracy.

show abstract

Large-Context Conversational Representation Learning: Self-Supervised Learning For Conversational Documents

Masumura

Makishima

Ihori

et al. 2021

View full text Add to dashboard Cite

Independent deeply learned matrix analysis with automatic selection of stable microphone-wise update and fast sourcewise update of demixing matrix

et al. 2021

View full text Add to dashboard Cite

Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition

Masumura¹,

Makishima²,

Ihori³

et al. 2020

View full text Add to dashboard Cite

Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation

Masumura

Makishima

Ihori

et al. 2021

View full text Add to dashboard Cite

We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utterance boundaries, well handle a sequence of utterances such as discourses and conversations. However, the transformer architecture, which has recently achieved state-of-the-art ASR performance among utterance-level ASR systems, has not yet been introduced into the large-context ASR systems. We can expect that the transformer architecture can be leveraged for effectively capturing not only input speech contexts but also long-range sequential contexts beyond utterance boundaries. Therefore, this paper proposes a hierarchical transformer-based large-context E2E-ASR model that combines the transformer architecture with hierarchical encoder-decoder based large-context modeling. In addition, in order to enable the proposed model to use long-range sequential contexts, we also propose a large-context knowledge distillation that distills the knowledge from a pre-trained large-context language model in the training phase. We evaluate the effectiveness of the proposed model and proposed training method on Japanese discourse ASR tasks.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Naoki Makishima

Independent Deeply Learned Matrix Analysis for Determined Audio Source Separation

Large-Context Conversational Representation Learning: Self-Supervised Learning For Conversational Documents

Independent deeply learned matrix analysis with automatic selection of stable microphone-wise update and fast sourcewise update of demixing matrix

Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition

Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation

Contact Info

Product

Resources

About