Dynamic language modeling for a daily broadcast news transcription system

Martins, Ciro; Teixeira, António; Neto, João Paulo

doi:10.1109/asru.2007.4430103

Cited by 23 publications

(12 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our case, we would like to define an automatic and optimized procedure to daily select the system vocabulary from three different corpora: an out-of-domain dataset (WEBNEWS-PT.train), an in-domain dataset (ALERT-SR.train+pilot) and the adaptation dataset daily collected from the Internet (WEBNEWS-PT.11march). For this purpose, in [Martins et al, 2007] we introduced a modified vocabulary selection technique that takes into account the differences in style across the various corpora, especially in case of written versus spoken style.…”

Section: Vocabulary Selection Using Morpho-syntactic Tagging (Pos)mentioning

confidence: 99%

“…In [Martins et al, 2007a] we proposed a daily and unsupervised adaptation approach which dynamically adapts the active vocabulary and language model to the topic of the current news segment using a multi-phase speech recognition process. Based on contemporary texts daily available on the Web, a story-based vocabulary is selected using the morphosyntactic technique described in section 4.4.…”

Section: Multi-phase Adaptation Frameworkmentioning

confidence: 99%

See 1 more Smart Citation

Dynamic language modeling for European Portuguese

Martins

Teixeira

Neto

2010

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

Section: Vocabulary Selection Using Morpho-syntactic Tagging (Pos)mentioning

confidence: 99%

Section: Multi-phase Adaptation Frameworkmentioning

confidence: 99%

Dynamic language modeling for European Portuguese

Martins

Teixeira

Neto

2010

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

“…In this approach, the decoder search space is a large WFST that maps observation distributions to words. The language model (LM) in the one described in [10] with an active lexica size of 100K word. It is build based on a daily and unsupervised adaptation approach which dynamically adapts the active vocabulary and LM to the topic of the current news.…”

Section: Tv Broadcast News Transcription Systemmentioning

confidence: 99%

Automatic Classification and Transcription of Telephone Speech in Radio Broadcast Data

Abad

Meinedo

Neto

2008

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Automatic transcription of telephone speech involves additional challenges compared to wideband data processing, mainly due to channel limitations and to particular characteristics of conversational telephone speech. While in TV speech recognition applications, such as automatic transcription of broadcast news, the presence of telephone data is nearly insignificant (less than 1 %), in most radio broadcast stations the presence of telephone speech grows significantly. Thus, transcription of telephone speech data deserves special attention in radio broadcast applications. In this work, we describe our initial efforts to tackle this particular problem. First, a telephone channel classifier is proposed to automatically detect telephone segments. Then, some strategies for increasing robustness of the automatic transcription system are investigated.

show abstract

“…We have started by calculating the relative frequency value of each word in the three corpora, added these values for equal words, and selected the 100,000 words with the highest value. This extremely simple solution revealed itself effective, but there are other solutions for this problem, like morpho-syntactic analysis [12]. This selection method added 6,549 parliament transcriptions words that weren't in the initial broadcast news vocabulary.…”

Section: Vocabulary and Lexical Modelmentioning

confidence: 99%

Domain Adaptation of a Broadcast News Transcription System for the Portuguese Parliament

Neves

Martins

Meinedo

et al. 2008

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. The main goal of this work is the adaptation of a broadcast news transcription system to a new domain, namely, the Portuguese Parliament plenary meetings. This paper describes the different domain adaptation steps that lowered our baseline absolute word error rate from 20.1% to 16.1%. These steps include the vocabulary selection, in order to include specific domain terms, language model adaptation, by interpolation of several different models, and acoustic model adaptation, using an unsupervised confidence based approach.

show abstract

Dynamic language modeling for a daily broadcast news transcription system

Cited by 23 publications

References 14 publications

Dynamic language modeling for European Portuguese

Dynamic language modeling for European Portuguese

Automatic Classification and Transcription of Telephone Speech in Radio Broadcast Data

Domain Adaptation of a Broadcast News Transcription System for the Portuguese Parliament

Contact Info

Product

Resources

About