Automatic Speech Recognition and Topic Identification from Speech for Almost-Zero-Resource Languages

Wiesner, Matthew; Liu, Chunxi; Ondel, Lucas; Harman, Craig; Manohar, Vimal; Trmal, Jan; Huang, Zhongqiang; Dehak, Najim; Khudanpur, Sanjeev

doi:10.21437/interspeech.2018-1836

Cited by 8 publications

(12 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Appropriate pretraining of the encoder and decoder reduced the WER by 20% absolute in the 4h Italian set, to 56.2%. This performance has been shown to still be usable for some downstream tasks such as topic identification in low-resource settings [29].…”

Section: Resultsmentioning

confidence: 99%

Pretraining by Backtranslation for End-to-End ASR in Low-Resource Settings

et al. 2019

Self Cite

View full text Add to dashboard Cite

We explore training attention-based encoder-decoder ASR in low-resource settings. These models perform poorly when trained on small amounts of transcribed speech, in part because they depend on having sufficient target-side text to train the attention and decoder networks. In this paper we address this shortcoming by pretraining our network parameters using only text-based data and transcribed speech from other languages. We analyze the relative contributions of both sources of data. Across 3 test languages, our text-based approach resulted in a 20% average relative improvement over a text-based augmentation technique without pretraining. Using transcribed speech from nearby languages gives a further 20-30% relative reduction in character error rate.

show abstract

Section: Resultsmentioning

confidence: 99%

Pretraining by Backtranslation for End-to-End ASR in Low-Resource Settings

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

Section: Evacuation Sheltermentioning

confidence: 99%

“…To solve the LORELEI task, prior work [8] used a mismatched ASR to directly decode IL speech, while [9] proposed sharing common phonemic representation among languages and transferring acoustic models trained on higher-resource (potentially related) language(s). After ASR, [8,9] translated both development (dev) and incident languages into English words, used the translated dev language data along with the given topic label annotations to learn English-language topic models and then classify the translated IL data. Additionally, instead of using ASR to convert speech into sequences of words, [10,11,9] also investigated unsupervised techniques to automatically discover and decode IL speech segments into phone-like units via acoustic unit discovery (AUD), or into wordlike units via unsupervised term discovery (UTD).…”

Section: Evacuation Sheltermentioning

confidence: 99%

“…We then transfer these models to a new language via a pronunciation lexicon with the same phonemic representation as used in training. We refer to this ASR as universal phone set ASR and we use the same approach as in [9]. Following [9], for experiments on Tigrinya and Oromo, we use the same selection of 10 BABEL languages for ASR training (∼600h).…”

Section: Universal Phone Set Asrmentioning

confidence: 99%

“…We refer to this ASR as universal phone set ASR and we use the same approach as in [9]. Following [9], for experiments on Tigrinya and Oromo, we use the same selection of 10 BABEL languages for ASR training (∼600h). For Russian, we use 10h subsets of 21 BABEL languages (∼200h) in training (all except Haitian, Vietnamese, Amharic, Georgian).…”

Section: Universal Phone Set Asrmentioning

confidence: 99%

See 2 more Smart Citations

Low-Resource Contextual Topic Identification on Speech

Liu

Wiesner

Watanabe

et al. 2018

2018 IEEE Spoken Language Technology Workshop (SLT)

Self Cite

View full text Add to dashboard Cite

In topic identification (topic ID) on real-world unstructured audio, an audio instance of variable topic shifts is first broken into sequential segments, and each segment is independently classified. We first present a general purpose method for topic ID on spoken segments in low-resource languages, using a cascade of universal acoustic modeling, translation lexicons to English, and English-language topic classification. Next, instead of classifying each segment independently, we demonstrate that exploring the contextual dependencies across sequential segments can provide large improvements. In particular, we propose an attention-based contextual model which is able to leverage the contexts in a selective manner. We test both our contextual and non-contextual models on four LORELEI languages, and on all but one our attention-based contextual model significantly outperforms the context-independent models.

show abstract