International audienceAs new innovative devices, accepting or producing on-line documents, emerge, managing facilities for these kinds of documents such as topic spotting are required. This means that we should be able to perform text categorization of on-line documents. The textual data available in on-line documents can be extracted through online recognition, a process which produces noise, i.e. errors, in the resulting text. This work reports experiments on categorization of on-line handwritten documents based on their textual contents. We analyze the effect of the word recognition rate on the categorization performances, by comparing the performances of a categorization system over the texts obtained through on-line handwriting recognition and the same texts available as ground truth. Two categorization algorithms (kNN and SVM) are compared in this work. A subset of the Reuters-21578 corpus consisting of more than 2000 handwritten documents has been collected for this study. Results show that accuracy loss is not significant, and precision loss is only significant for recall values of 60%-80% depending on the noise levels
In this work, we propose to combine two quite different approaches for retrieving handwritten documents. Our hypothesis is that different retrieval algorithms should retrieve different sets of documents for the same query. Therefore, significant improvements in retrieval performances can be expected. The first approach is based on information retrieval techniques carried out on the noisy texts obtained through handwriting recognition, while the second approach is recognition-free using a word spotting algorithm. Results shows that for texts having a word error rate (WER) lower than 23%, the performances obtained with the combined system are close to the performances obtained on clean digital texts. In addition, for poorly recognized texts (WER > 52%), an improvement of nearly 17% can be observed with respect to the best available baseline method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.