On Foreign Name Search

Soo, Jason

doi:10.1007/978-3-642-12275-0_42

Cited by 7 publications

(4 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For completeness, Segments is a system that takes an input string, and using 6 substring rules, returns a list of possible correction candidates derived from a lexicon, ranked by similarity. A detailed Segments description is found in [21,22,20,23]. Recent research has reaffirmed the potential of segmenting strings by using said segments to perform authorship attribution [19].…”

Section: Segmentsmentioning

confidence: 99%

Revisiting Known-Item Retrieval in Degraded Document Collections

Soo

2016

Self Cite

View full text Add to dashboard Cite

Optical character recognition software converts an image of text to a text document but typically degrades the document's contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5's Confusion Track containing estimated error rates of 5% and 20%. On the 5% dataset, we demonstrate a statistically significant improvement over the prior art and Solr's mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately 5% and 20% respectively.

show abstract

Section: Segmentsmentioning

confidence: 99%

Revisiting Known-Item Retrieval in Degraded Document Collections

Soo

2016

Self Cite

View full text Add to dashboard Cite

show abstract

“…UNLV [25], IMPACT [26]), but these datasets are not applicable to our work, as these datasets do not provide means to accurately evaluate our system; namely, they are lacking query relevance (qrel) judgments. Without those, we would only be measuring the correction accuracy of Segments, which has already been exhaustively studied in prior papers using heterogeneous datasets [27], [17], [18]. Therefore, despite the age of the TREC collection, it remains the only collection that provides ground truth, corrupted text, and 3rd party qrel judgments, in a publicly available package.…”

Section: B Limitationsmentioning

confidence: 99%

“…Over the past years, we evaluated methods for reliably correcting phase one errors via post-processing using our method called Segments [17], [18], [19]. Segments differs from previous research in that it is an unsupervised approach, which makes minimal assumptions about resource availability, and has no dependence on language within the algorithm.…”

Section: Introductionmentioning

confidence: 99%

Searching Corrupted Document Collections

Soo

2016

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

Self Cite

View full text Add to dashboard Cite

Historical documents are typically digitized using optical Character Recognition. While effective, the results may not always be accurate and are highly dependent on the input. Consequently, degraded documents are often corrupted. Our focus is finding flexible, reliable methods to correct for such degradation, in the face of limited resources. We extend upon our substring and context fusion based retrieval system known as Segments, to consider metadata. By extracting topics from documents, and supplementing and weighting our lexicon with co-occurring terms found in documents with those topics, we achieve a statistically significant improvement over the state-ofthe-art in all but one test configuration. Our mean reciprocal rank measured on two free, publicly available, independently judged datasets is 0.7657 and 0.5382.

show abstract

“…In general, supervised algorithms outperform unsupervised algorithms, particularly in cases in which context is important in correcting a word (Lim, ); however, they cannot be used in the absence of training data. We describe an unsupervised approach that has no dependence on domain, language structure, or sequential windows (Soo, ; Soo & Frieder, ). The proposed solution outperforms prior unsupervised solutions and is comparable with a leading supervised approach.…”

Section: Introductionmentioning

confidence: 99%