The Zero Resource Speech Challenge 2019: TTS Without T

Dunbar, Ewan; Algayres, Robin; Karadayi, Julien; Bernard, Mathieu; Benjumea, Juan; Cao, Xuan-Nga; Miskic, Lucie; Dugrain, Charlotte; Ondel, Lucas; Black, Alan W.; Besacier, Laurent; Sakti, Sakriani; Dupoux, Emmanuel

doi:10.21437/interspeech.2019-2904

Cited by 109 publications

(128 citation statements)

References 0 publications

Supporting

Mentioning

122

Contrasting

Order By: Relevance

“…languages [5,6] or pretraining using unsupervised objectives [7,8]. At the extreme of this continuum, zero resource ASR discovers its own units from raw speech [9,10,11]. Despite many interesting results, the field lacks a common benchmark (datasets, evaluations, or baselines) for comparing ideas and results across these settings.…”

Section: Introductionmentioning

confidence: 99%

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Kahn

Rivière

Zheng

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

299

229

View full text Add to dashboard Cite

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semisupervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.Index Terms-unsupervised and semi-supervised learning, distant supervision, dataset, zero-and low resource ASR.

show abstract

Section: Introductionmentioning

confidence: 99%

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Kahn

Rivière

Zheng

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

299

229

View full text Add to dashboard Cite

show abstract

“…Features should ideally disregard irrelevant information (such as speaker and gender), while capturing linguistically meaningful contrasts (such as phone or word categories). Several different unsupervised frame-level acoustic feature learning methods have been developed over the last few years [6]- [12], with neural networks being used in a number of studies [13]- [17].…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Feature Learning for Speech Using Correspondence and Siamese Networks

Engelbrecht

Kamper

2020

IEEE Signal Process. Lett.

View full text Add to dashboard Cite

In zero-resource settings where transcribed speech audio is unavailable, unsupervised feature learning is essential for downstream speech processing tasks. Here we compare two recent methods for frame-level acoustic feature learning. For both methods, unsupervised term discovery is used to find pairs of word examples of the same unknown type. Dynamic programming is then used to align the feature frames between each word pair, serving as weak top-down supervision for the two models. For the correspondence autoencoder (CAE), matching frames are presented as input-output pairs. The Triamese network uses a contrastive loss to reduce the distance between frames of the same predicted word type while increasing the distance between negative examples. For the first time, these feature extractors are compared on the same discrimination tasks using the same weak supervision pairs. We find that, on the two datasets considered here, the CAE outperforms the Triamese network. However, we show that a new hybrid correspondence-Triamese approach (CTriamese), consistently outperforms both the CAE and Triamese models in terms of average precision and ABX error rates on both English and Xitsonga evaluation data.

show abstract

“…Recent work has considered unsupervised learning for a variety of speech tasks. Some of this work is explicitly aimed at a "zero-speech" setting where no or almost no labeled data is available at all (e.g., [14,15,16,17]), where the focus is to learn phonetic or word-like units, or representations that can distinguish among such units. Other work considers a variety of downstream supervised tasks, and some focuses explicitly on learning representations that generalize across tasks or across very different domains [6,7,18,19].…”

Section: Related Workmentioning

confidence: 99%

Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction

Wang

Tang

Livescu

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose an approach for pre-training speech representations via a masked reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be used directly in typical bidirectional speech recognition models. The pre-trained networks can then be fine-tuned on a smaller amount of supervised data for speech recognition. Experiments with this approach on the LibriSpeech and Wall Street Journal corpora show promising results. We find that the main factors that lead to speech recognition improvements are: masking segments of sufficient width in both time and frequency, pre-training on a much larger amount of unlabeled data than the labeled data, and domain adaptation when the unlabeled and labeled data come from different domains. The gain from pre-training is additive to that of supervised data augmentation.

show abstract

The Zero Resource Speech Challenge 2019: TTS Without T

Cited by 109 publications

References 0 publications

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Unsupervised Feature Learning for Speech Using Correspondence and Siamese Networks

Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction

Contact Info

Product

Resources

About