The zero resource speech challenge 2017

Dunbar, Ewan; Cao, Xuan; Benjumea, Juan; Karadayi, Julien; Bernard, Mathieu; Besacier, Laurent; Anguera, Xavier; Dupoux, Emmanuel

doi:10.1109/asru.2017.8268953

Cited by 152 publications

(221 citation statements)

References 30 publications

(38 reference statements)

Supporting

Mentioning

221

Contrasting

Order By: Relevance

“…For g c , we use five convolutional layers with strides [5, 4, 2, 2, 2], filter-sizes [10,8,4,4,4] and 256 hidden units with ReLU activations. Besides, the features are normalized channelwise between each convolution.…”

Section: S212 Architecture Detailsmentioning

confidence: 99%

See 1 more Smart Citation

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Kahn

Rivière

Zheng

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

299

229

View full text Add to dashboard Cite

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semisupervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.Index Terms-unsupervised and semi-supervised learning, distant supervision, dataset, zero-and low resource ASR.

show abstract

Section: S212 Architecture Detailsmentioning

confidence: 99%

“…languages [5,6] or pretraining using unsupervised objectives [7,8]. At the extreme of this continuum, zero resource ASR discovers its own units from raw speech [9,10,11]. Despite many interesting results, the field lacks a common benchmark (datasets, evaluations, or baselines) for comparing ideas and results across these settings.…”

Section: Introductionmentioning

confidence: 99%

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Kahn

Rivière

Zheng

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

299

229

View full text Add to dashboard Cite

show abstract

“…For many low-resource languages, however, it is difficult or impossible to collect such annotated resources. Motivated by the observation that infants acquire language without hard supervision, studies into "zero-resource" speech technology have started to develop unsupervised systems that can learn directly from unlabelled speech audio [1][2][3].…”

Section: Introductionmentioning

confidence: 99%

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Kamper

Matusevych

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing and discovery systems. Here we propose to train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zeroresource languages. For this transfer learning approach, we consider two multilingual recurrent neural network models: a discriminative classifier trained on the joint vocabularies of all training languages, and a correspondence autoencoder trained to reconstruct word pairs. We test these using a word discrimination task on six target zero-resource languages. When trained on seven well-resourced languages, both models perform similarly and outperform unsupervised models trained on the zero-resource languages. With just a single training language, the second model works better, but performance depends more on the particular training-testing language pair.Index Terms-Acoustic word embeddings, multilingual models, zero-resource speech processing, query-by-example.

show abstract

“…Even for resource-rich languages, preparing transcriptions for available training data is a time-consuming task that involves considerable human effort. For many languages in the world, very little or no transcribed speech is available [6], and conventional acoustic modeling techniques are simply not applicable. S. Feng Unsupervised speech modeling is the task of building subword or word-level AMs, when only untranscribed speech are available for training [7]- [9].…”

Section: Introductionmentioning

confidence: 99%

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling

Feng

Lee

2019

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

This research addresses the problem of acoustic modeling of low-resource languages for which transcribed training data is absent. The goal is to learn robust frame-level feature representations that can be used to identify and distinguish subword-level speech units. The proposed feature representations comprise various types of multilingual bottleneck features (BNFs) that are obtained via multi-task learning of deep neural networks (MTL-DNN). One of the key problems is how to acquire highquality frame labels for untranscribed training data to facilitate supervised DNN training. It is shown that learning of robust BNF representations can be achieved by effectively leveraging transcribed speech data and well-trained automatic speech recognition (ASR) systems from one or more out-of-domain (resourcerich) languages. Out-of-domain ASR systems can be applied to perform speaker adaptation with untranscribed training data of the target language, and to decode the training speech into framelevel labels for DNN training. It is also found that better frame labels can be generated by considering temporal dependency in speech when performing frame clustering. The proposed methods of feature learning are evaluated on the standard task of unsupervised subword modeling in Track 1 of the ZeroSpeech 2017 Challenge. The best performance achieved by our system is 9.7% in terms of across-speaker triphone minimal-pair ABX error rate, which is comparable to the best systems reported recently. Lastly, our investigation reveals that the closeness between target languages and out-of-domain languages and the amount of available training data for individual target languages could have significant impact on the goodness of learned features.

show abstract

The zero resource speech challenge 2017

Cited by 152 publications

References 30 publications

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling

Contact Info

Product

Resources

About