Yannick Estève scite author profile

In this paper, we present TED-LIUM release 3 corpus 3 dedicated to speech recognition in English, which multiplies the available data to train acoustic models in comparison with TED-LIUM 2, by a factor of more than two. We present the recent development on Automatic Speech Recognition (ASR) systems in comparison with the two previous releases of the TED-LIUM Corpus from 2012 and 2014. We demonstrate that, passing from 207 to 452 hours of transcribed speech training data is really more useful for end-to-end ASR systems than for HMM-based state-of-the-art ones. This is the case even if the HMMbased ASR system still outperforms the end-to-end ASR system when the size of audio training data is 452 hours, with a Word Error Rate (WER) of 6.7% and 13.7%, respectively. Finally, we propose two repartitions of the TED-LIUM release 3 corpus: the legacy repartition that is the same as that existing in release 2, and a new repartition, calibrated and designed to make experiments on speaker adaptation. Similar to the two first releases, TED-LIUM 3 corpus will be freely available for the research community.

show abstract

End-To-End Named Entity And Semantic Concept Extraction From Speech

Ghannay

Caubrière

Estève

et al. 2018

View full text Add to dashboard Cite

Named entity recognition (NER) is among SLU tasks that usually extract semantic information from textual documents. Until now, NER from speech is made through a pipeline process that consists in processing first an automatic speech recognition (ASR) on the audio and then processing a NER on the ASR outputs. Such approach has some disadvantages (error propagation, metric to tune ASR systems sub-optimal in regards to the final task, reduced space search at the ASR output level,...) and it is known that more integrated approaches outperform sequential ones, when they can be applied. In this paper, we present a first study of end-to-end approach that directly extracts named entities from speech, though a unique neural architecture. On a such way, a joint optimization is able for both ASR and NER. Experiments are carried on French data easily accessible, composed of data distributed in several evaluation campaign. Experimental results show that this end-to-end approach provides better results (F-measure=0.69 on test data) than a classical pipeline approach to detect named entity categories (F-measure=0.65).

show abstract

Sentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments

Medhaffar¹,

Bougares²,

Estève³

et al. 2017

117

View full text Add to dashboard Cite

Dialectal Arabic (DA) is significantly different from the Arabic language taught in schools and used in written communication and formal speech (broadcast news, religion, politics, etc.). There are many existing researches in the field of Arabic language Sentiment Analysis (SA); however, they are generally restricted to Modern Standard Arabic (MSA) or some dialects of economic or political interest. In this paper we focus on SA of the Tunisian dialect. We use Machine Learning techniques to determine the polarity of comments written in Tunisian dialect. First, we evaluate the SA systems performances with models trained using freely available MSA and Multi-dialectal data sets. We then collect and annotate a Tunisian dialect corpus of 17.000 comments from Facebook. This corpus shows a significant improvement compared to the best model trained on other Arabic dialects or MSA data. We believe that this first freely available 12 corpus will be valuable to researchers working in the field of Tunisian Sentiment Analysis and similar areas.

show abstract

Curriculum-Based Transfer Learning for an Effective End-to-End Spoken Language Understanding and Domain Portability

Caubrière¹,

Tomashenko²,

Laurent³

et al. 2019

View full text Add to dashboard Cite

We present an end-to-end approach to extract semantic concepts directly from the speech audio signal. To overcome the lack of data available for this spoken language understanding approach, we investigate the use of a transfer learning strategy based on the principles of curriculum learning. This approach allows us to exploit out-of-domain data that can help to prepare a fully neural architecture. Experiments are carried out on the French MEDIA and PORTMEDIA corpora and show that this end-toend SLU approach reaches the best results ever published on this task. We compare our approach to a classical pipeline approach that uses ASR, POS tagging, lemmatizer, chunker... and other NLP tools that aim to enrich ASR outputs that feed an SLU text to concepts system. Last, we explore the promising capacity of our end-to-end SLU approach to address the problem of domain portability.

show abstract

ASR Error Management for Improving Spoken Language Understanding

Simonnet¹,

Ghannay²,

Camelin³

et al. 2017

View full text Add to dashboard Cite

This paper addresses the problem of automatic speech recognition (ASR) error detection and their use for improving spoken language understanding (SLU) systems. In this study, the SLU task consists in automatically extracting, from ASR transcriptions, semantic concepts and concept/values pairs in a e.g touristic information system. An approach is proposed for enriching the set of semantic labels with error specific labels and by using a recently proposed neural approach based on word embeddings to compute well calibrated ASR confidence measures. Experimental results are reported showing that it is possible to decrease significantly the Concept/Value Error Rate with a state of the art system, outperforming previously published results performance on the same experimental data. It also shown that combining an SLU approach based on conditional random fields with a neural encoder/decoder attention based architecture, it is possible to effectively identifying confidence islands and uncertain semantic output segments useful for deciding appropriate error handling actions by the dialogue manager strategy.

show abstract

Automatic named identification of speakers using diarization and ASR systems

Jousse¹,

Petitrenaud²,

Meignier³

et al. 2009

View full text Add to dashboard Cite

In this paper, we consider the extraction of speaker identity from audio records of broadcast news without a priori acoustic information about speakers. Using an automatic speech recognition system and an automatic speaker diarization system, we present improvements for a method which allows to extract speaker identities from automatic transcripts and to assign them to speech segments.Experiments are carried out on French broadcast news records from the ESTER 1 evaluation campaign. Experimental results using outputs of automatic speech recognition and automatic diarization are presented.

show abstract

Investigating Adaptation and Transfer Learning for End-to-End Spoken Language Understanding from Speech

Tomashenko¹,

Caubrière²,

Estève³

2019

View full text Add to dashboard Cite

This work investigates speaker adaptation and transfer learning for spoken language understanding (SLU). We focus on the direct extraction of semantic tags from the audio signal using an end-to-end neural network approach. We demonstrate that the learning performance of the target predictive function for the semantic slot filling task can be substantially improved by speaker adaptation and by various knowledge transfer approaches. First, we explore speaker adaptive training (SAT) for end-to-end SLU models and propose to use zero pseudo ivectors for more efficient model initialization and pretraining in SAT. Second, in order to improve the learning convergence for the target semantic slot filling (SF) task, models trained for different tasks, such as automatic speech recognition and named entity extraction are used to initialize neural end-to-end models trained for the target task. In addition, we explore the impact of the knowledge transfer for SLU from a speech recognition task trained in a different language. These approaches allow to develop end-to-end SLU systems in low-resource data scenarios when there is no enough in-domain semantically labeled data, but other resources, such as word transcriptions for the same or another language or named entity annotation, are available.

show abstract

Speaker Diarization: About whom the Speaker is Talking ?

Mauclair¹,

Meignier²,

Estève³

2006

View full text Add to dashboard Cite

The automatic speaker diarization consists in splitting the signal into homogeneous segments and clustering them by speakers. However the speaker segments are specified with anonymous labels. This paper proposed a solution to identify those speakers by extracting their full names pronounced in the show. With a semantic classification tree automatically built on a training corpus, the full names detected in transcription of a segment are associated to this segment or to one of its neighbors. Then, a merging method allows to associate a full name to a speaker cluster instead of a anonymous label provided by the diarization. The experiments are carried out over French broadcast news records from the ESTER 2005 evaluation campaign. About 70% show duration is correctly processed for both development and evaluation corpora. On the evaluation corpus, 18.15% show duration is wrongly named and no decision is taken for 11.91% show duration.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yannick Estève

TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation

End-To-End Named Entity And Semantic Concept Extraction From Speech

Sentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments

Curriculum-Based Transfer Learning for an Effective End-to-End Spoken Language Understanding and Domain Portability

ASR Error Management for Improving Spoken Language Understanding

Automatic named identification of speakers using diarization and ASR systems

Investigating Adaptation and Transfer Learning for End-to-End Spoken Language Understanding from Speech

Speaker Diarization: About whom the Speaker is Talking ?

Contact Info

Product

Resources

About