XCDF: A Canonical and Structured Document Format

Bloechle, Jean-Luc; Rigamonti, Maurizio; Hadjar, Karim; Lalanne, Denis; Ingold, Rolf

doi:10.1007/11669487_13

Cited by 19 publications

(12 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With regard to conversion of legacy PDF documents, DIVA research group proposed a reverse engineering tool XED [1] to analyze the embedded resources of PDF files and generate their physical structures in a format XCDF [2]. Based on the application of XCDF format, another interactive system Dolores [3] was presented to recover logical structure of newspaper through neural network learning mechanism.…”

Section: Introductionmentioning

confidence: 99%

Logical Labeling of Fixed Layout PDF Documents Using Multiple Contexts

Tao

Tang

et al. 2014

2014 11th IAPR International Workshop on Document Analysis Systems

View full text Add to dashboard Cite

The task of logical structure recovery is known to be of crucial importance, yet remains unsolved not only for image based document but also for born-digital document system. In this work, the modeling of contextual information based on 2D Conditional Random Fields is proposed to learn page structure for born-digital fixed-layout documents. Heuristic prior knowledge of Portable Document Format (PDF) content and layout are interpreted to construct neighborhood graphs and various pairwise clique templates for the modeling of multiple contexts. By integrating local and contextual observations obtained from PDF attributes, the ambiguities of semantic labels are better resolved. Experimental comparisons for six types of clique templates has demonstrated the benefits of contextual information in logical labeling of 16 finely defined categories.

show abstract

Section: Introductionmentioning

confidence: 99%

Logical Labeling of Fixed Layout PDF Documents Using Multiple Contexts

Tao

Tang

et al. 2014

2014 11th IAPR International Workshop on Document Analysis Systems

View full text Add to dashboard Cite

show abstract

“…The basic idea of this segmentation algorithm is inspired from the works [14]. That is, the purpose is to first merge the closest CCs horizontally, in order to recover lines, and then groups these lines together to get blocks of text.…”

Section: B Line and Block Segmentationmentioning

confidence: 99%

Semi-automatic Annotation Tool for Medieval Manuscripts

Baechler

Bloechle

Ingold

2010

2010 12th International Conference on Frontiers in Handwriting Recognition

Self Cite

View full text Add to dashboard Cite

Medieval manuscript layouts are quite complex. They contain textual elements such as insertions, annotations, and corrections. They may be richly decorated with ornaments, illustrations, and decorative initials making their layout even more complex. In this paper we describe a semi-automatic tool which annotates medieval manuscripts using our generic format. This format allows to represent the physical structure of such manuscripts. Our semi-automatic tool is composed of two parts. The first part achieves a layout analysis which automatically segments manuscripts into text blocks and text lines. That is, a Multi-Layer Perceptron (MLP) identifies layout elements by using color features; it extracts the textual content image of the manuscript. Then, a segmentation based on Connected Component (CC) is performed on the textual content in order to retrieve text blocks and lines. The second part provides an interactive interface allowing the user to customize the automatic analysis, to visualize its results, and to correct them. Our tool is still a prototype, nevertheless, first experiments give encouraging results. Thus, in this paper, we show how to generate a ground truth for medieval manuscripts layouts.

show abstract

“…OCD Dolores is based upon the OCD [8] file format (whereas previous versions used XCDF [9]), an XML representation able to efficiently store the content, layout and physical structures extracted and recovered from electronic documents (thanks to XED [8]). …”

Section: The Interactive Learning Environmentmentioning

confidence: 99%