The 300k LIMSI German broadcast news transcription system

McTait, Kevin; Adda-Decker, Martine

doi:10.21437/eurospeech.2003-102

Cited by 18 publications

(4 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The performance of the ASR is displayed in table 2. Taking the diversity of the test corpus with its high amount of spontaneous speech into account, the word error rates (WER) is comparable to the performance of other systems [10]. The better performance of the sub-word based recognition can be explained by outof-vocabulary (OOV) effects in the word recognition, For evaluating the NER, precision, recall and the fmeasure are defined in the usual way [8].…”

Section: Resultsmentioning

confidence: 99%

“…The large word count is necessary because of the compounding strategy inherent in the German language [10]. Grapheme-to-phoneme conversion for the word pronunciation lexicon was carried out using the transcription module of the Bonn Open Source Synthesis System (BOSSII) developed by the Institut für Kommunikationsforschung und Phonetik of Bonn University [4].…”

Section: Speech Recognition Systemmentioning

confidence: 99%

See 1 more Smart Citation

Named Entity Recognition of Spoken Documents Using Subword Units

Paaß¹,

Pilz²,

Schwenninger³

2009

2009 IEEE International Conference on Semantic Computing

View full text Add to dashboard Cite

The output of a speech recognition system is a stream of text features that is overlayed by noise resulting from errors in the system's statistical classification of the audio input. Conditional Random Fields (CRFs), which have already proven themselves to be efficient, high-performance Named Entity Recognizers (NERs) for named entities from text, offer the promise to compensate part of these errors. In this paper we use CRFs to extract named entities from spoken audio documents. We consider a real-world audio information extraction scenario under which CRFs are trained to recognize named entities in unedited radio audio documents that have been converted into a stream of text features by a speech recognition system. The automatic speech recognition system (ASR) is able to produce word transcriptions as well as syllables. It uses general speaker-independent acoustic models and a domain-independent statistical language model, insuring that recognizer performance is not specific to the experimental domain. Using an additional syllable model increases the generality of the spoken document classification system, giving it the flexibility to handle words that are not present in the vocabulary. In this paper we apply for the first time CRFs to different features produced by German ASR. The experiments confirm that using transcribed syllables together with words can compensate for part of the NER errors caused by ASR transcription.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Speech Recognition Systemmentioning

confidence: 99%

Named Entity Recognition of Spoken Documents Using Subword Units

Paaß¹,

Pilz²,

Schwenninger³

2009

2009 IEEE International Conference on Semantic Computing

View full text Add to dashboard Cite

show abstract

“…A more detailed description of the baseline system can be found in [1]. The baseline ASR system yields a performance comparable to other state of the art systems for German such as [2] (see section 4), yet the word error rate was still to high for displaying the corresponding transcripts to the end users of the ARD Mediathek. We apply a twofold adaptation strategy in order to reduce the mismatch between our baseline ASR model and the heterogeneous ARD data: acoustic and language model adaptation.…”

Section: Automatic Speech Recognitionmentioning

confidence: 99%

Social recommendation using speech recognition: Sharing TV scenes in social networks

Schneider

Tschöpel

Schwenninger

2012

2012 13th International Workshop on Image Analysis for Multimedia Interactive Services

View full text Add to dashboard Cite

We describe a novel system which simplifies recommendation of video scenes in social networks, thereby attracting a new audience for existing video portals. Users can select interesting quotes from a speech recognition transcript, and share the corresponding video scene with their social circle with minimal effort. The system has been designed in close cooperation with the largest German public broadcaster (ARD), and was deployed at the broadcasters public video portal. A twofold adaptation strategy adapts our speech recognition system to the given use case. First, a database of speakeradapted acoustic models for the most important speakers in the corpus is created. We use spectral speaker identification for detecting whether one of these speakers is speaking, and select the correspondin g model accordingly. Second, we apply language model adaptation by exploiting prior knowledge about the video category

show abstract

“…The German BN transcription system (DE e ) in [McTait & Adda-Decker, 2003] e enhances the system in [Lamel & Gauvain, 2002] d essentially by incorporating new language data for the estimation of AM and LM, and by reducing the effects of intense compounding of German words on the lexical coverage by scaling up its size from 65k to 300k words. Noteworthily, decomposing compounds at morpheme boundaries would allow the constitution of a virtually infinite vocabulary and thereby would maintain a relatively small effective vocabulary.…”

Section: Recognition Task Complexitymentioning

confidence: 99%

Large vocabulary continuous speech recognition for the transcription of Catalan broadcast news and conversations : towards analysis and modelling of acoustic reduction in spontaneous speech

Schulz

View full text Add to dashboard Cite

The transcription of spontaneous speech still poses a challenge to state-of-the-art methods for automatic speech recognition. The present thesis describes the comprehensive development of a large vocabulary continuous speech recognition system for the transcription of Catalan broadcast news and conversions and evolves towards novel approaches for analysis and modelling of acoustic reduction in spontaneous speech. It emphasises initially on various conventional methods for acoustic analysis, acoustic and language modelling and hypothesis search. Improvements over the original single-pass baseline system are mainly attained by domain and speaking style emphasising interpolation of individually estimated language models, linear discriminating projection of acoustic observations that improves the phonetic class separability, speaker normalisation of the acoustic observations, speaker adaptive training and acoustic model adaptation in a multi-pass system approach. The analysis of acoustic reduction initially emphasises on context independent vowel and consonant specific spectral and temporal properties whose parameters display statistically significant differences between the phoneme prototypes in spontaneous speech and their canonical realisations in planned speech. The introduction of the feature space analysis provides the general means to reveal these differences in conventional acoustic observations for automatic speech recognition. It displays statistically significant differences context-independently but also in a syllable context between adjacent phonemes suggesting particular reduction patterns. The analysis furthermore challenges the often suggested coherence between the co-occurring reduction of spectral and temporal properties. The modelling of acoustic reduction first emphasises on segment conditioned discriminating variables and variability class dependent models and variability class specific adaptation of the original acoustic model. It introduces phoneme rate as means to analyse temporal properties and feature space reduction ratio as means to analyse the reduction of spectral properties in conventional feature space for large vocabulary continuous speech recognition as discriminating variables. These variables are clustered and determine the classes for segment conditioned variability class dependent models and their scoring during the hypothesis search in recognition. Both approaches displays no significant performance improvement. Furthermore the modelling advances towards segment constituent predictability dependent models that introduce predictability as discriminating variable for variability class dependent models relying on the fundamental coherence between predictability and acoustic reduction that is suggested through the principle of least effort and the redundancy theory. It thereby emphasises on word and phoneme predictability. This approach displays no significant performance improvement. Planned speech is apparently antagonising the principle of least effort. Thus, a prior segment conditioned analysis of acoustic reduction may indicate its average degree of reduction, while their within-segment variation may indicate whether it exhibits sufficient relaxation of the speaking style to adopt the principle of least effort. Thus, segments exhibiting small within-segment variation may be modelled separately from those of large within-segment variation, whereas modelling the latter by word, syllable or phoneme predictability dependent models may provide a research perspective. La transcripció de converses espontànies encara suposa un repte per als mètodes actuals de reconeixement automàtic de veu. Aquesta tesi descriu el desenvolupament d'un sistema de reconeixement de veu continu de vocabulari gran per a la transcripció de converses i notícies emeses en català i condueix cap a noves aproximacions per a l'anàlisi i modelat de la reducció acústica en converses espontànies. Es centra inicialment en diversos mètodes convencionals per a l'anàlisi acústica, modelat acústic i del llenguatge i en la cerca d'hipòtesis. Les millores respecte el sistema original d'única passada són principalment degudes al domini i l'estil en la parla posant èmfasi en la interpolació de models de llenguatge, discriminació lineal i projecció d'observacions acústiques, entrenament adaptat al locutor per millorar la separació de les classes fonètiques, normalització de les observacions acústiques, i adaptació del model acústic en una sistema de múltiples passades. L'anàlisi de reducció acústica posa inicialment èmfasi en les propietats espectrals i temporals independents de vocals i consonant específiques, els paràmetres de les quals mostren diferències estadísticament significatives entre els prototips de fonemes en la conversa espontània i la seva realització canònica en el discurs planejat. La introducció de l'anàlisi del espai de característiques proporciona els mitjans generals per a revelar aquestes diferències en observacions acústiques convencionals per al reconeixement automàtic de veu. Mostra diferències estadísticament significatives independents de context però també entre fonemes adjacents en el context de síl·laba suggerint patrons de reducció particulars. A més, l'anàlisi desafia la, sovint suggerida, coherència entre les reducció simultànies de les propietats espectrals i temporals. El modelat de la reducció acústica primer fa èmfasi en variables discriminants de cada segment, models dependents de la variabilitat de la classe i l'adaptació del model acústic original. Introdueix la taxa de fonemes com a mitjà d'analitzar propietats temporals i la proporció de la reducció del espai de característiques com a mitjà d'analitzar la reducció dels propietats espectrals en el espai de característiques convencional per al reconeixement de veu continu de vocabulari gran com a variables discriminants. Aquestes variables s'agrupen i determinen les classes per a models dependents de la variabilitat de cada segment i la seva puntuació durant el reconeixement i cerca d'hipòtesi. Ambdues aproximacions no mostren una millora significativa en el rendiment. A més a més, les tècniques de modelat es dirigeixen cap a models dependents de la predicibilitat del segment que introdueixen la predicibilitat com a variable discriminant per a models dependents de la classe de variabilitat basats en la coherència fonamental entre predicibilitat i reducció acústica que es suggereix pel principi del mínim esforç i la teoria de la redundància. Per tant, emfatitza la predicibilitat de les paraules i dels fonemes. Aquesta aproximació no suposa cap millora significativa de rendiment. El discurs planejat és aparentment antagònic amb el principi del mínim esforç. Per tant, un anàlisi previ condicionat al segment de la reducció acústica pot indicar el seu grau mig de reducció, mentre la variació intra-segmental pot indicar si exhibeix prou relaxació en l'estil de parlar per adoptar el principi del mínim esforç. Per tant, segments amb poca variació intra-segmental poden ser modelats apart dels que tenen gran variació intra-segmental, mentre que modelar aquestes darreres mitjançant models dependents de predicibilitat de paraula, síl·laba o fonema poden aportar una perspectiva viable de recerca.

show abstract

The 300k LIMSI German broadcast news transcription system

Cited by 18 publications

References 6 publications

Named Entity Recognition of Spoken Documents Using Subword Units

Named Entity Recognition of Spoken Documents Using Subword Units

Social recommendation using speech recognition: Sharing TV scenes in social networks

Large vocabulary continuous speech recognition for the transcription of Catalan broadcast news and conversations : towards analysis and modelling of acoustic reduction in spontaneous speech

Contact Info

Product

Resources

About