Jean-Luc Bloechle scite author profile

Rigamonti

Hadjar

et al. 2006

Accessing the structured content of PDF document is a difficult task, requiring pre-processing and reverse engineering techniques. In this paper, we first present different methods to accomplish this task, which are based either on document image analysis, or on electronic content extraction. Then, XCDF, a canonical format with well-defined properties is proposed as a suitable solution for representing structured electronic documents and as an entry point for further researches and works. The system and methods used for reverse engineering PDF document into this canonical format are also presented. We finally present current applications of this work into various domains, spacing from data mining to multimedia navigation, and consistently benefiting from our canonical format in order to access PDF document content and structures.

Towards a canonical and structured representation of PDF documents through reverse engineering

Rigamonti

Hadjar

et al. 2005

Reverse-Engineering of PDF Files

Ingold¹,

Bloechle²,

Rigamonti³

2014

OCD Dolores - Recovering Logical Structures for Dummies

Rigamonti

Ingold³

2012

Labeled images verification using Gaussian mixture models

Baechler

Hennebert³

2009

We are proposing in this paper an automated system to verify that images are correctly associated to labels. The novelty of the system is in the use of Gaussian Mixture Models (GMMs) as statistical modeling scheme as well as in several improvements introduced specifically for the verification task. Our approach is evaluated using the Caltech 101 database. Starting from an initial baseline system providing an equal error rate of 27.4%, we show that the rate of errors can be reduced down to 13% by introducing several optimizations of the system. The advantage of the approach lies in the fact that basically any object can be generically and blindly modeled with limited supervision. A potential target application could be a post-filtering of images returned by search engines to prune out or reorder less relevant images.