Document retrieval tolerating character recognition errors—evaluation and application

Marukawa, Katsumi; Hu, Tao; Fujisawa, Hiromichi; Shima, Yoshihiro

doi:10.1016/s0031-3203(96)00155-0

Cited by 22 publications

(10 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This improved average precision retrieval effectiveness in all but one case. However, a further study reported in (Marukawa et al, 1997) again showed the ineffectiveness of query expansion for retrieval from corrupted text. In this research 1083 Japanese news articles were searched using 50 test queries.…”

Section: University Of Nevada Las Vegasmentioning

confidence: 97%

Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Lam-Adesina

Jones

2006

Information Processing & Management

View full text Add to dashboard Cite

Important legacy paper documents are digitized and collected in online accessible archives. This enables the preservation, sharing, and significantly the searching of these documents. The text contents of these document images can be transcribed automatically using OCR systems and then stored in an information retrieval system. However, OCR systems make errors in character recognition which have previously been shown to impact on document retrieval behaviour. In particular relevance feedback query-expansion methods, which are often effective for improving electronic text retrieval, are observed to be less reliable for retrieval of scanned document images. Our experimental examination of the effects of character recognition errors on an ad hoc OCR retrieval task demonstrates that, while baseline information retrieval can remain relatively unaffected by transcription errors, relevance feedback via query expansion becomes highly unstable. This paper examines the reason for this behaviour, and introduces novel modifications to standard relevance feedback methods. These methods are shown experimentally to improve the effectiveness of relevance feedback for errorful OCR transcriptions. The new methods combine similar recognised character strings based on term collection frequency and a string edit-distance measure. The techniques are domain independent and make no use of external resources such as dictionaries or training data.

show abstract

Section: University Of Nevada Las Vegasmentioning

confidence: 97%

Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Lam-Adesina

Jones

2006

Information Processing & Management

View full text Add to dashboard Cite

show abstract

“…Although there has been some work in trying to compensate for optical character recognition (OCR) errors introduced into automatically scanned text documents (Marukawa et al 1997;Zhai et al 1996), the area of robust methods for dealing with speech recognition errors in the context of spoken document retrieval is still relatively new. There has been some recent work in this area performed independently and in parallel to the work presented in this thesis.…”

Section: Motivationmentioning

confidence: 99%

“…For text documents, there has been work in trying to compensate for optical character recognition (OCR) errors introduced into automatically scanned text documents (Marukawa et al 1997;Zhai et al 1996). In (Marukawa et al 1997), two methods are proposed to deal with character recognition errors for Japanese text documents. One method uses a character error confusion matrix to generate "equivalent" query strings to try to match erroneously recognized text.…”

Section: Related Workmentioning

confidence: 99%

Subword-based approaches for spoken document retrieval

Zue

2000

Speech Communication

148

109

View full text Add to dashboard Cite

This thesis explores approaches to the problem of spoken document retrieval (SDR), which is the task of automatically indexing and then retrieving relevant items from a large collection of recorded speech messages in response to a user specified natural language text query. We investigate the use of subword unit representations for SDR as an alternative to words generated by either keyword spotting or continuous speech recognition. Our investigation is motivated by the observation that word-based retrieval approaches face the problem of either having to know the keywords to search for a priori, or requiring a very large recognition vocabulary in order to cover the contents of growing and diverse message collections. The use of subword units in the recognizer constrains the size of the vocabulary needed to cover the language; and the use of subword units as indexing terms allows for the detection of new user-specified query terms during retrieval.Four research issues are addressed. First, what are suitable subword units and how well can they perform? Second, how can these units be reliably extracted from the speech signal? Third, what is the behavior of the subword units when there are speech recognition errors and how well do they perform? And fourth, how can the indexing and retrieval methods be modified to take into account the fact that the speech recognition output will be errorful?We first explore a range of subword units of varying complexity derived from error-free phonetic transcriptions and measure their ability to effectively index and retrieve speech messages. We find that many subword units capture enough information to perform effective retrieval and that it is possible to achieve performance comparable to that of text-based word units. Next, we develop a phonetic speech recognizer and process the spoken document collection to generate phonetic transcriptions. We then measure the ability of subword units derived from these transcriptions to perform spoken document retrieval and examine the effects of recognition errors on retrieval performance. Retrieval performance degrades for all subword units (to 60% of the clean reference), but remains reasonable for some subword units even without the use of any error compensation techniques. We then investigate a number of robust methods that take into account the characteristics of the recognition errors and try to compensate for them in an effort to improve spoken document retrieval performance when there are speech recognition errors. We study the methods individually and explore the effects of combining them. Using these robust methods improves retrieval performance by 23%. We also propose a novel approach to SDR where the speech recognition and information retrieval components are more tightly integrated. This is accomplished by developing new recognizer and retrieval models where the interface between the two 3 components is better matched and the goals of the two components are consistent with each other and with the overall goal of the combine...

show abstract

“…Speci®cally there have been a number of systems for problems similar to the one we discuss here although using different approaches [35,36]. One that deals with mathematical expressions in a scienti®c document has recently been described in an overall document processing system [35].…”

Section: Semi-structured Documents and Error Correctionmentioning

confidence: 99%

“…One characterization uses a confusion matrix (as in speech recognition) to generate``equivalent'' query strings that should match erroneously recognized text. The other one searches``non-deterministic text'' that contains multiple candidates for ambiguous recognition results [36]. Another approach uses an approximate tree matching method to identify similarities between the documents' structured parts and samples to perform information extraction.…”

Section: Semi-structured Documents and Error Correctionmentioning

confidence: 99%

Processing noisy structured textual data using a fuzzy matching approach: application to postal address errors

2000

View full text Add to dashboard Cite

A multiparadigm approach is developed and demonstrated for exploiting knowledge about structure for the purpose of extracting information from noisy textual data. A motivating example of a potential application would be an address encoding system for a delivery service such as UPS, Federal Express or the United States Post Of®ce. This approach combines aspects of database organization and clustering of records, fuzzy parsing, fuzzy retrieval, an aggregation algebra, and measures of both performance and accuracy. Fuzzy retrieval, in the form of set and fuzzy operators, is accomplished by considering each symbol of the input text to be imperfect and retrieving non-exact matching records from the database that hold for a particular threshold value. The set of low-level database operators constrains the cardinality and accuracy of retrievals. A hierarchical method of clustering the database is de®ned, whereby the records are partitioned in a manner such that similar records are in the same cluster. This clustering strategy is guaranteed to be mutually exclusive and a complete cover of the data records. Associated with these clusters is an algebra that combines clusters of data into one window of ranked data. A set of fuzzy measures is de®ned that are used to aggregate and rank sets of records.

show abstract

Document retrieval tolerating character recognition errors—evaluation and application

Cited by 22 publications

References 2 publications

Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Subword-based approaches for spoken document retrieval

Processing noisy structured textual data using a fuzzy matching approach: application to postal address errors

Contact Info

Product

Resources

About