Named entity extraction from noisy input

Miller, David R.; Boisen, Sean; Schwartz, Richard; Stone, Rebecca; Weischedel, Ralph

doi:10.3115/974147.974191

Cited by 73 publications

(59 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…He applied a NER system on transcriptions of broadcast news, and reported that its performance degraded linearly with the word error rate of speech recognition (e.g., missing data, misspelled data and spuriously tagged names). Named entity recognition in speech data has been investigated further, but this related work has focused on either decreasing the error rate when transcribing speech [15,20], on considering different speech transcription hypotheses [11,3], or on the issue of temporal mismatch between training and test data [8]. None of these articles consider exploiting external text sources to improve NER in speech data nor the problem of recovering missing named entities in transcribed speech.…”

Section: Prior Workmentioning

confidence: 99%

See 1 more Smart Citation

An IR-Inspired Approach to Recovering Named Entity Tags in Broadcast News

Shrestha

Vulić

Moens

2013

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. We propose a new approach to improving named entity recognition (NER) in broadcast news speech data. The approach proceeds in two key steps:(1) we automatically detect document alignments between highly similar speech documents and corresponding written news stories that are easily obtainable from the Web; (2) we employ term expansion techniques commonly used in information retrieval to recover named entities that were initially missed by the speech transcriber. We show that our method is able to find named entities missing in the transcribed speech data, and additionally to correct incorrectly assigned named entity tags. Consequently, our novel approach improves state-of-the-art NER results from speech data both in terms of recall and precision.

show abstract

Section: Prior Workmentioning

confidence: 99%

“…For instance, the Stanford NER system in the CoNLL 2003 shared task on NER in written data report an F 1 value of 87.94% [23]. [13,15] report a degrade of NER performance between 20-25% in F 1 value when applying a NER trained on written data to transcribed speech.…”

Section: Introductionmentioning

confidence: 99%

An IR-Inspired Approach to Recovering Named Entity Tags in Broadcast News

Shrestha

Vulić

Moens

2013

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…The extraction of named entities from speech has been used with large vocabulary ASR, most notably Broadcast News, associated with the DARPA HUB-4 task [4,39,27], as well as with similar corpora in Chinese [64] or French [20]. Although the speech in these corpora is not, for the most part, spontaneous, the extraction of proper names, locations, and organizations represents a significant advancement in the processing of this type of data.…”

Section: Extracting Meaning From Speechmentioning

confidence: 99%

Speech for Content Creation

Polifroni

Kiss

Seneff³

2011

International Journal of Mobile Human Computer Interaction

View full text Add to dashboard Cite

In this position paper, we propose a different paradigm for using speech to interact with computers: speech for content creation. We survey the literature in automatic speech recognition (ASR), natural language processing (NLP), sentiment detection, and opinion mining to argue that the time has come to use mobile devices to create content on-the-fly. We examine recent work in user modelling and recommender systems to support our claim that using speech in this way can result in a useful interface to uniquely personalizable data. We describe a data collection effort we've recently undertaken to help us build a prototype system for spoken restaurant reviews. This vision critically depends on mobile technology, for enabling the creation of the content and for providing ancillary data to make its processing more relevant to individual users. We feel this type of system can be of use even where only limited speech processing is possible.

show abstract

“…There is previous research connecting OCR with information extraction, including [16] and [11] who demonstrate that the quality of information extraction is reduced in the presence of OCR errors. Work involving the extraction of named entities from OCR output include [12,8].…”

Section: Introductionmentioning

confidence: 99%

Performing information extraction to improve OCR error detection in semi-structured historical documents

Packer

2011

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing

View full text Add to dashboard Cite

Optical character recognition (OCR) produces transcriptions of document images. These transcriptions often contain incorrectly recognized characters which we must avoid or correct downstream. An ability to both identify OCR errors and extract information from OCR output would allow us to extract and index only correct information and to post-process specific parts of the OCR output with targeted resources (e.g. re-OCR using specialized dictionaries). We present a general approach to OCR error detection that uses a hidden Markov model trained to simultaneously detect OCR errors and extract information. We evaluate this approach in two information extraction settings and on semi-structured text from two machine-printed family history documents. We show this joint approach to OCR error detection to be an improvement over two alternative approaches, one based on dictionary matching and the other using a hidden Markov model trained only to detect OCR errors. In particular, we report an average of 8% increase in macro-averaged F-measure between the dictionary approach and our best HMM. Our contribution is to show how an OCR error detection approach based on a word model can be improved by joining this task with an information extraction task, and that an improvement in OCR error detection is achieved regardless of the information extraction task.

show abstract

Named entity extraction from noisy input

Cited by 73 publications

References 4 publications

An IR-Inspired Approach to Recovering Named Entity Tags in Broadcast News

An IR-Inspired Approach to Recovering Named Entity Tags in Broadcast News

Speech for Content Creation

Performing information extraction to improve OCR error detection in semi-structured historical documents

Contact Info

Product

Resources

About