2006
DOI: 10.1007/11669487_13
|View full text |Cite
|
Sign up to set email alerts
|

XCDF: A Canonical and Structured Document Format

Abstract: Accessing the structured content of PDF document is a difficult task, requiring pre-processing and reverse engineering techniques. In this paper, we first present different methods to accomplish this task, which are based either on document image analysis, or on electronic content extraction. Then, XCDF, a canonical format with well-defined properties is proposed as a suitable solution for representing structured electronic documents and as an entry point for further researches and works. The system and method… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2006
2006
2023
2023

Publication Types

Select...
5
4
1

Relationship

2
8

Authors

Journals

citations
Cited by 19 publications
(12 citation statements)
references
References 16 publications
0
12
0
Order By: Relevance
“…With regard to conversion of legacy PDF documents, DIVA research group proposed a reverse engineering tool XED [1] to analyze the embedded resources of PDF files and generate their physical structures in a format XCDF [2]. Based on the application of XCDF format, another interactive system Dolores [3] was presented to recover logical structure of newspaper through neural network learning mechanism.…”
Section: Introductionmentioning
confidence: 99%
“…With regard to conversion of legacy PDF documents, DIVA research group proposed a reverse engineering tool XED [1] to analyze the embedded resources of PDF files and generate their physical structures in a format XCDF [2]. Based on the application of XCDF format, another interactive system Dolores [3] was presented to recover logical structure of newspaper through neural network learning mechanism.…”
Section: Introductionmentioning
confidence: 99%
“…The basic idea of this segmentation algorithm is inspired from the works [14]. That is, the purpose is to first merge the closest CCs horizontally, in order to recover lines, and then groups these lines together to get blocks of text.…”
Section: B Line and Block Segmentationmentioning
confidence: 99%
“…OCD Dolores is based upon the OCD [8] file format (whereas previous versions used XCDF [9]), an XML representation able to efficiently store the content, layout and physical structures extracted and recovered from electronic documents (thanks to XED [8]). …”
Section: The Interactive Learning Environmentmentioning
confidence: 99%