Composite embedding systems for ZeroSpeech2017 Track1

Shibata, Hayato; Kato, Taku; Shinozaki, Takahiro; Watanabet, Shinji

doi:10.1109/asru.2017.8269012

Cited by 15 publications

(49 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A DNN was trained using these labels to generate BNF or posteriorgram representation. In [5], [14], language-mismatched ASR systems were utilized to decode the target speech, and frame labels were generated from the ASR decoding lattices. In [30], BNF representation was generated by applying multi-task learning with both indomain and out-of-domain data [25].…”

Section: Related Work a Deep Learning Approaches To Unsupervisementioning

confidence: 99%

“…The frame labels for out-of-domain data were obtained by HMM forced alignment, while the labels for in-domain data were from DPGMM clustering [12]. In [5], [14], [31], a DNN AM was trained with transcribed data of an out-of-domain language, and used to extract BNFs or posteriorgrams from target speech.…”

Section: Related Work a Deep Learning Approaches To Unsupervisementioning

confidence: 99%

“…The proposed system design emphasizes on leveraging speech data resources from out-of-domain languages [5], [14]. This is realized in the following aspects:…”

Section: Proposed Systemmentioning

confidence: 99%

“…Typically a well-trained DNN-based AM requires hundreds to thousands of hours of transcribed speech. As a matter of fact, highperformance ASR systems are available only for major languages [5]. Even for resource-rich languages, preparing transcriptions for available training data is a time-consuming task that involves considerable human effort.…”

Section: Introductionmentioning

confidence: 99%

“…In addition to the DPGMM-HMM labels, a different type of frame labels can be obtained using one or more out-ofdomain ASR systems [5], [14]. While the DPGMM-HMM frame labels incorporate statistical information of the acoustic properties of target speech, the ASR senone labels leverage the phonetic information acquired from out-of-domain languages.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling

Feng

Lee

2019

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

This research addresses the problem of acoustic modeling of low-resource languages for which transcribed training data is absent. The goal is to learn robust frame-level feature representations that can be used to identify and distinguish subword-level speech units. The proposed feature representations comprise various types of multilingual bottleneck features (BNFs) that are obtained via multi-task learning of deep neural networks (MTL-DNN). One of the key problems is how to acquire highquality frame labels for untranscribed training data to facilitate supervised DNN training. It is shown that learning of robust BNF representations can be achieved by effectively leveraging transcribed speech data and well-trained automatic speech recognition (ASR) systems from one or more out-of-domain (resourcerich) languages. Out-of-domain ASR systems can be applied to perform speaker adaptation with untranscribed training data of the target language, and to decode the training speech into framelevel labels for DNN training. It is also found that better frame labels can be generated by considering temporal dependency in speech when performing frame clustering. The proposed methods of feature learning are evaluated on the standard task of unsupervised subword modeling in Track 1 of the ZeroSpeech 2017 Challenge. The best performance achieved by our system is 9.7% in terms of across-speaker triphone minimal-pair ABX error rate, which is comparable to the best systems reported recently. Lastly, our investigation reveals that the closeness between target languages and out-of-domain languages and the amount of available training data for individual target languages could have significant impact on the goodness of learned features.

show abstract

Section: Related Work a Deep Learning Approaches To Unsupervisementioning

confidence: 99%

Section: Related Work a Deep Learning Approaches To Unsupervisementioning

confidence: 99%

“…The proposed system design emphasizes on leveraging speech data resources from out-of-domain languages [5], [14]. This is realized in the following aspects:…”

Section: Proposed Systemmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling

Feng

Lee

2019

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition

Zhang

Song

et al. 2022

Circuits Syst Signal Process

View full text Add to dashboard Cite

Multilingual and unsupervised subword modeling for zero-resource languages

Hermann

Kamper

Goldwater

2021

Computer Speech & Language

View full text Add to dashboard Cite

Unsupervised subword modeling aims to learn lowlevel representations of speech audio in "zero-resource" settings: that is, without using transcriptions or other resources from the target language (such as text corpora or pronunciation dictionaries). A good representation should capture phonetic content and abstract away from other types of variability, such as speaker differences and channel noise. Previous work in this area has primarily focused on learning from target language data only, and has been evaluated only intrinsically. Here we directly compare multiple methods, including some that use only target language speech data and some that use transcribed speech from other (non-target) languages, and we evaluate using two intrinsic measures as well as on a downstream unsupervised word segmentation and clustering task. We find that combining two existing target-language-only methods yields better features than either method alone. Nevertheless, even better results are obtained by extracting target language bottleneck features using a model trained on other languages. Cross-lingual training using just one other language is enough to provide this benefit, but multilingual training helps even more. In addition to these results, which hold across both intrinsic measures and the extrinsic task, we discuss the qualitative differences between the different types of learned features.

show abstract

Composite embedding systems for ZeroSpeech2017 Track1

Cited by 15 publications

References 14 publications

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling

Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition

Multilingual and unsupervised subword modeling for zero-resource languages

Contact Info

Product

Resources

About