Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi

McAuliffe, Michael; Socolof, Michaela; Mihuc, Sarah; Wagner, Michael; Sonderegger, Morgan

doi:10.21437/interspeech.2017-1386

Cited by 673 publications

(407 citation statements)

References 16 publications

Supporting

Mentioning

391

Contrasting

Unclassified

Order By: Relevance

“…To measure the accessibility of phonetic information, we train linear phone classifiers using Mel-features, APC and Mockingjay representations from the LibriSpeech train-clean-360 subset. We obtain force-aligned phoneme sequences with the Montreal Forced Aligner [24], where there are 72 possible phone classes. Testing results on the LibriSpeech test-clean subset are presented in Figure 3.…”

Section: Phoneme Classificationmentioning

confidence: 99%

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

Liu

Yang

Chi

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

272

222

View full text Add to dashboard Cite

We present Mockingjay as a new speech representation learning approach, where bidirectional Transformer encoders are pre-trained on a large amount of unlabeled speech. Previous speech representation methods learn through conditioning on past frames and predicting information about future frames. Whereas Mockingjay is designed to predict the current frame through jointly conditioning on both past and future contexts. The Mockingjay representation improves performance for a wide range of downstream tasks, including phoneme classification, speaker recognition, and sentiment classification on spoken content, while outperforming other approaches. Mockingjay is empirically powerful and can be fine-tuned with downstream models, with only 2 epochs we further improve performance dramatically. In a low resource setting with only 0.1% of labeled data, we outperform the result of Mel-features that uses all 100% labeled data.

show abstract

Section: Phoneme Classificationmentioning

confidence: 99%

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

Liu

Yang

Chi

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

272

222

View full text Add to dashboard Cite

show abstract

“…Additionally, we find that using phonemes as intermediate targets speeds up word-level pre-training [37][38][39]. We use the Montreal Forced Aligner [40] to obtain wordand phoneme-level alignments for LibriSpeech, and we pretrain the model on the entire 960 hours of training data using these alignments 5,6 . Using force-aligned labels has the additional benefit of enabling pre-training using short, random crops rather than entire utterances, which reduces the computation and memory required to pre-train the model.…”

Section: Which Asr Targets To Use?mentioning

confidence: 99%

Speech Model Pre-Training for End-to-End Spoken Language Understanding

Lugosch

Ravanelli

Ignoto³

et al. 2019

Interspeech 2019

176

268

View full text Add to dashboard Cite

Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-toend SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-toend models without a large amount of training data is difficult. We propose a method to reduce the data requirements of endto-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the model's ability to generalize to new phrases not heard during training.

show abstract

“…We forced aligned each spoken caption to its transcription (using the Montreal Forced Aligner [18] and Maus Forced Aligner [19] for English and Japanese respectively), resulting in alignments at word and phone level. We also tagged each dataset using TreeTagger [20] for English and KyTea [21] for Japanese.…”

Section: English and Japanese Corporamentioning

confidence: 99%

Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese

Havard

Chevrot

Besacier

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We investigate the behaviour of attention in neural models of visually grounded speech trained on two languages: English and Japanese. Experimental results show that attention focuses on nouns and this behaviour holds true for two very typologically different languages. We also draw parallels between artificial neural attention and human attention and show that neural attention focuses on word endings as it has been theorised for human attention. Finally, we investigate how two visually grounded monolingual models can be used to perform cross-lingual speech-to-speech retrieval. For both languages, the enriched bilingual (speech-image) corpora with part-of-speech tags and forced alignments are distributed to the community for reproducible research.Index Termsgrounded language learning, attention mechanism, cross-lingual speech retrieval, recurrent neural networks.

show abstract

Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi

Cited by 673 publications

References 16 publications

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

Speech Model Pre-Training for End-to-End Spoken Language Understanding

Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese

Contact Info

Product

Resources

About