Querying out-of-vocabulary words in lexicon-based keyword spotting

Puigcerver, Joan; Toselli, Alejandro H.; Vidal, Enrique

doi:10.1007/s00521-016-2197-8

Cited by 12 publications

(5 citation statements)

References 29 publications

(80 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A word similarity can be computed in terms of character edit distances, possibly weighted by estimated optical dissimilarity between character pairs. Work exploring this idea, along with the use the Filler-HMM model as a back-off method, is presented in [43].…”

Section: Additional Results and Comparisonsmentioning

confidence: 99%

HMM word graph based keyword spotting in handwritten document images

Toselli

Vidal

Romero

et al. 2016

Information Sciences

Self Cite

View full text Add to dashboard Cite

Line-level keyword spotting (KWS) is presented on the basis of frame-level word posterior probabilities. These posteriors are obtained using word graphs derived from the recognition process of a full-fledged handwritten text recognizer based on hidden Markov models and N-gram language models. This approach has several advantages. First, since it uses a holistic, segmentation-free technology, it does not require any kind of word or character segmentation. Second, the use of language models allows the context of each spotted word to be taken into account, thereby considerably increasing the KWS accuracy. And third, the proposed KWS scores are based on true posterior probabilities, computed taking into account all (or most) possible word segmentations of the input image. These scores are properly bounded and normalized. This mathematically clean formulation lends itself to smooth, threshold-based keyword queries which, in turn, permit comfortable trade-offs between search precision and recall. Experiments are carried out on several historic collections of handwritten text images, as well as with a well-known dataset of modern English handwritten text. According to the empirical results, the proposed approach achieves KWS results comparable to those obtained with the recently-introduced "BLSTM neural networks KWS" approach and clearly outperform the popular, state-of-the-art "Filler HMM" KWS method. Overall, the results clearly support all the above-claimed advantages of the proposed approach.

show abstract

Section: Additional Results and Comparisonsmentioning

confidence: 99%

HMM word graph based keyword spotting in handwritten document images

Toselli

Vidal

Romero

et al. 2016

Information Sciences

Self Cite

View full text Add to dashboard Cite

show abstract

“…The idea is to smooth the (implicitly null) relevance probabilities of OOV keywords by relying on the indexed probabilities of "similar" in-vocabulary words. Most of our work in this direction is reviewed or presented in [49]. While reasonably good results are achieved with these methods, they always entail query response time penalties for OOV queriesand these penalties can become prohibitive for large collections of say hundreds of thousands or millions of images.…”

Section: Discussionmentioning

confidence: 99%

A Probabilistic Framework for Lexicon-based Keyword Spotting in Handwritten Text Images

Vidal¹,

Toselli²,

Puigcerver³

2021

Preprint

View full text Add to dashboard Cite

Query by String Keyword Spotting (KWS) is here considered as a key technology for indexing large collections of handwritten text images to allow fast textual access to the contents of these colections. Under this prespective, a probabilistic framework for lexicon-based KWS in text images is presented. The presentation aims at providing a tutorial view which helps understanding the relations between classical statements of KWS and the relative challenges entailed by these statements. More specifically, the development of the proposed framework makes it self-evident that word recognition or classification implicitly or explicitly underlies any formulation of KWS. Moreover, it clearly suggests that the same statistical models and training methods successfully used for handwriting text recognition, can advantageously used also for KWS, even though KWS does not generally require or rely on any kind of previously produced image transcripts. These ideas are developped into a specific, probabilistically sound approach for segmentation-free, lexicon-based, query-by-string KWS. Experiments carried out using this approach are presented, which support the consistency and general interest of the proposed framework. Several datasets, traditionally used for KWS benchmarking are considered, with results significantly better than those previously published for these datasets. In addition, results on two new, larger handwritten text image datasets are reported, showing the great potential of the methods proposed in this paper for indexing and textual search in large collections of handwritent documents.

show abstract

“…Moreover, it can recognise and retrieve results for words where there are historical or personal variations in spelling. Thus, this form of searching can produce useable results with HTR models that have higher error rates, up to 30 per cent CER (Giotis et al, 2017;Puigcerver et al, 2015Puigcerver et al, , 2017Retsinas et al, 2016;Strauß et al, 2016;Toselli et al, 2017). The platform displays the results of a Keyword Spotting query as a list of transcribed words, thumbnail images of the portion of the digitised pages on which those words appear and a confidence rating for each word.…”

Section: Contributementioning

confidence: 99%

Transforming scholarship in the archives through handwritten text recognition

Muehlberger

Seaward

Terras

et al. 2019

Self Cite

View full text Add to dashboard Cite

Purpose An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues. Design/methodology/approach This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material. Findings Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified. Research limitations/implications The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc. Practical implications Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field. Social implications The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals. Originality/value This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector.

show abstract

Querying out-of-vocabulary words in lexicon-based keyword spotting

Cited by 12 publications

References 29 publications

HMM word graph based keyword spotting in handwritten document images

HMM word graph based keyword spotting in handwritten document images

A Probabilistic Framework for Lexicon-based Keyword Spotting in Handwritten Text Images

Transforming scholarship in the archives through handwritten text recognition

Contact Info

Product

Resources

About