2016
DOI: 10.2352/issn.2470-1173.2016.17.drr-065
|View full text |Cite
|
Sign up to set email alerts
|

Revisiting Known-Item Retrieval in Degraded Document Collections

Abstract: Optical character recognition software converts an image of text to a text document but typically degrades the document's contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5's Confusion Track containing estimated error rates of 5% and 20%. On the 5% datase… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

1
3
0

Year Published

2016
2016
2017
2017

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 18 publications
1
3
0
Order By: Relevance
“…This suggests that 2 may be the best candidate vector size for most applications. This is supported by previous research ( [24], [19]) which show that a balance of context-free and context-dependent candidates perform best. By using a fusion method with a candidate vector size of 2, we select 2 candidates based on context (selected using bigrams) and 2 candidates not based on context (selected using Segments' substrings rules).…”
Section: A Methodssupporting
confidence: 85%
See 3 more Smart Citations
“…This suggests that 2 may be the best candidate vector size for most applications. This is supported by previous research ( [24], [19]) which show that a balance of context-free and context-dependent candidates perform best. By using a fusion method with a candidate vector size of 2, we select 2 candidates based on context (selected using bigrams) and 2 candidates not based on context (selected using Segments' substrings rules).…”
Section: A Methodssupporting
confidence: 85%
“…For example, we don't include symbols as white space when tokenizing because of the increased probability of OCR systems to replace characters with punctuation. A detailed explanation of the filtering process is described in [19]. We then iterate over all the filtered terms, tasking each evaluated approach to generate substitution candidates for the unrecognized terms.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations