Accessing the structured content of PDF document is a difficult task, requiring pre-processing and reverse engineering techniques. In this paper, we first present different methods to accomplish this task, which are based either on document image analysis, or on electronic content extraction. Then, XCDF, a canonical format with well-defined properties is proposed as a suitable solution for representing structured electronic documents and as an entry point for further researches and works. The system and methods used for reverse engineering PDF document into this canonical format are also presented. We finally present current applications of this work into various domains, spacing from data mining to multimedia navigation, and consistently benefiting from our canonical format in order to access PDF document content and structures.
This paper presents OCD Dolores, an environment that aims at recovering the logical structures from documents by interactively inferring their models. Dolores is based on OCD, an XML canonical document format used to represent structured electronic content efficiently. The relevance of our restructuring system is assessed through a deep evaluation of Dolores' logical labeling capacities.
We are proposing in this paper an automated system to verify that images are correctly associated to labels. The novelty of the system is in the use of Gaussian Mixture Models (GMMs) as statistical modeling scheme as well as in several improvements introduced specifically for the verification task. Our approach is evaluated using the Caltech 101 database. Starting from an initial baseline system providing an equal error rate of 27.4%, we show that the rate of errors can be reduced down to 13% by introducing several optimizations of the system. The advantage of the approach lies in the fact that basically any object can be generically and blindly modeled with limited supervision. A potential target application could be a post-filtering of images returned by search engines to prune out or reorder less relevant images.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.