Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-1386
|View full text |Cite
|
Sign up to set email alerts
|

Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi

Abstract: We present the Montreal Forced Aligner (MFA), a new opensource system for speech-text alignment. MFA is an update to the Prosodylab-Aligner, and maintains its key functionality of trainability on new data, as well as incorporating improved architecture (triphone acoustic models and speaker adaptation), and other features. MFA uses Kaldi instead of HTK, allowing MFA to be distributed as a stand-alone package, and to exploit parallel processing for computationally-intensive training and scaling to larger dataset… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
391
0
3

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 673 publications
(407 citation statements)
references
References 16 publications
0
391
0
3
Order By: Relevance
“…To measure the accessibility of phonetic information, we train linear phone classifiers using Mel-features, APC and Mockingjay representations from the LibriSpeech train-clean-360 subset. We obtain force-aligned phoneme sequences with the Montreal Forced Aligner [24], where there are 72 possible phone classes. Testing results on the LibriSpeech test-clean subset are presented in Figure 3.…”
Section: Phoneme Classificationmentioning
confidence: 99%
“…To measure the accessibility of phonetic information, we train linear phone classifiers using Mel-features, APC and Mockingjay representations from the LibriSpeech train-clean-360 subset. We obtain force-aligned phoneme sequences with the Montreal Forced Aligner [24], where there are 72 possible phone classes. Testing results on the LibriSpeech test-clean subset are presented in Figure 3.…”
Section: Phoneme Classificationmentioning
confidence: 99%
“…Additionally, we find that using phonemes as intermediate targets speeds up word-level pre-training [37][38][39]. We use the Montreal Forced Aligner [40] to obtain wordand phoneme-level alignments for LibriSpeech, and we pretrain the model on the entire 960 hours of training data using these alignments 5,6 . Using force-aligned labels has the additional benefit of enabling pre-training using short, random crops rather than entire utterances, which reduces the computation and memory required to pre-train the model.…”
Section: Which Asr Targets To Use?mentioning
confidence: 99%
“…We forced aligned each spoken caption to its transcription (using the Montreal Forced Aligner [18] and Maus Forced Aligner [19] for English and Japanese respectively), resulting in alignments at word and phone level. We also tagged each dataset using TreeTagger [20] for English and KyTea [21] for Japanese.…”
Section: English and Japanese Corporamentioning
confidence: 99%