Librispeech: An ASR corpus based on public domain audio books

Panayotov, Vassil; Chen, Guoguo; Povey, Daniel; Khudanpur, Sanjeev

doi:10.1109/icassp.2015.7178964

Cited by 4,186 publications

(2,723 citation statements)

References 15 publications

Supporting

Mentioning

2,452

Contrasting

Unclassified

Order By: Relevance

“…We used the Kaldi toolkit (Povey et al, 2011) and publicly available acoustic models trained on the LibriSpeech corpus (Panayotov et al, 2015). The forced alignment was spot-checked manually for accuracy and found to be very accurate.…”

Section: Calculating Reading Ratementioning

confidence: 99%

Towards Understanding Text Factors in Oral Reading

Loukina

Liceralde

Klebanov

2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

View full text Add to dashboard Cite

Using a case study, we show that variation in oral reading rate across passages for professional narrators is consistent across readers and much of it can be explained using features of the texts being read. While text complexity is a poor predictor of the reading rate, a substantial share of variability can be explained by timing and story-based factors with performance reaching r=0.75 for unseen passages and narrator.

show abstract

Section: Calculating Reading Ratementioning

confidence: 99%

Towards Understanding Text Factors in Oral Reading

Loukina

Liceralde

Klebanov

2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

View full text Add to dashboard Cite

show abstract

“…In order to verify our design we have used a set of the models, generated by the standard Kaldi model-generating recipe for the LibriSpeech acoustic corpus (Panayotov et al, 2015). Specifically, we have used the Deep Neural NetworkWeighted Finite State Transducer (DNN-WFST) hybrid with i-vector acoustic adaptation.…”

Section: Methodsmentioning

confidence: 99%

“…Maximum norm for gradient clipping was set to 5. During model training, we applied dropout (dropout rate 0.5) to the non-recurrent connections (Zaremba et al, 2014) of RNN and the hidden layers of MLPs, and applied L2 regularization (λ = 10 −4 ) on the parameters of MLPs.For the evaluation in ASR settings, we used the acoustic model trained on LibriSpeech dataset (Panayotov et al, 2015), and the language model trained on ATIS training corpus. A 2-gram language model was used during decoding.…”

mentioning

confidence: 99%

Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Katagiri

Nakano

Fernández

et al. 2016

View full text Add to dashboard Cite

We extend special thanks to our Local co-Chairs, Ron Artstein and Alesia Gainer, and their team of student volunteers. We know SIGDIAL 2016 would not have been possible without Ron and Alesia, who invested so much effort in arranging the conference venue and accommodations, handling registration, making banquet arrangements, and handling numerous other preparations for the conference. The student volunteers for on-site assistance also deserve our appreciation.Ethan Selfridge, Sponsorships Chair, has earned our appreciation for recruiting and liaising with our conference sponsors, many of whom continue to contribute year after year. Sponsorships support valuable aspects of the program, such as the invited speakers and conference banquet. In recognition of this, we gratefully acknowledge the support of our sponsors: (Platinum level) Microsoft Research, Xerox and PARC, Intel, (Gold level) Facebook, (Silver level) Amazon Alexa, Interactions, Educational Testing Service, Honda Research Institute, and Yahoo!. At the same time, we thank Priscilla Rasmussen at the ACL for tirelessly handling the financial aspects of sponsorship for SIGDIAL 2016, and for securing our ISBN.iii We also thank the SIGdial board, especially officers Amanda Stent, Jason Williams and Kristiina Jokinen for their advice and support from beginning to end.Finally, we thank all the authors of the papers in this volume, and all the conference participants for making this stimulating event a valuable opportunity for growth in the research areas of discourse and dialogue. AbstractThis paper presents an end-to-end framework for task-oriented dialog systems using a variant of Deep Recurrent QNetworks (DRQN). The model is able to interface with a relational database and jointly learn policies for both language understanding and dialog strategy. Moreover, we propose a hybrid algorithm that combines the strength of reinforcement learning and supervised learning to achieve faster learning speed. We evaluated the proposed model on a 20 Question Game conversational game simulator. Results show that the proposed method outperforms the modular-based baseline and learns a distributed representation of the latent dialog state. IntroductionTask-oriented dialog systems have been an important branch of spoken dialog system (SDS) research (Raux et al., 2005; Young, 2006; Bohus and Rudnicky, 2003). The SDS agent has to achieve some predefined targets (e.g. booking a flight) through natural language interaction with the users. The typical structure of a task-oriented dialog system is outlined in Figure 1 (Young, 2006). This pipeline consists of several independently-developed modules: natural language understanding (the NLU) maps the user utterances to some semantic representation. This information is further processed by the dialog state tracker (DST), which accumulates the input of the turn along with the dialog history. The DST outputs the current dialog state and the dialog policy selects the next system action based on the dialog state. Then natural language gene...

show abstract

“…Speech utterances for simulating target and interference speech were picked from the Librispeech [32] dataset. They were divided into training, development, and test sets with no overlap.…”

Section: Signal Generation and Feature Extractionmentioning

confidence: 99%

Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment

Sivasankaran¹,

Fohr²

2018

Interspeech 2018

View full text Add to dashboard Cite

To cite this version:Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr. Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment. Interspeech 2018 -19th AbstractSpeaker localization is a hard task, especially in adverse environmental conditions involving reverberation and noise. In this work we introduce the new task of localizing the speaker who uttered a given keyword, e.g., the wake-up word of a distantmicrophone voice command system, in the presence of overlapping speech. We employ a convolutional neural network based localization system and investigate multiple identifiers as additional inputs to the system in order to characterize this speaker.We conduct experiments using ground truth identifiers which are obtained assuming the availability of clean speech and also in realistic conditions where the identifiers are computed from the corrupted speech. We find that the identifier consisting of the ground truth time-frequency mask corresponding to the target speaker provides the best localization performance and we propose methods to estimate such a mask in adverse reverberant and noisy conditions using the considered keyword.

show abstract

Librispeech: An ASR corpus based on public domain audio books

Cited by 4,186 publications

References 15 publications

Towards Understanding Text Factors in Oral Reading

Towards Understanding Text Factors in Oral Reading

Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment

Contact Info

Product

Resources

About