Proceedings of the Eleventh International Conference on Information and Knowledge Management 2002
DOI: 10.1145/584792.584828
|View full text |Cite
|
Sign up to set email alerts
|

Structural extraction from visual layout of documents

Abstract: Most information extraction systems focus on the textual content of the documents. They treat documents as sequences of words, disregarding the physical and typographical layout of the information. While this strategy helps in focusing the extraction process on the key semantic content of the document, much valuable information can also be derived form the document physical appearance. Often, fonts, physical positioning and other graphical characteristics are used to provide additional context to the informati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2006
2006
2010
2010

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 16 publications
(10 citation statements)
references
References 4 publications
0
10
0
Order By: Relevance
“…al. [19] are based on templates that characterize each part of the document. These templates are either extracted manually or semi-automatically.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…al. [19] are based on templates that characterize each part of the document. These templates are either extracted manually or semi-automatically.…”
Section: Related Workmentioning
confidence: 99%
“…al. [19] devised a learning algorithm to extract information (author, title, date, etc) that relies on a general procedure for structural extraction. Their proposed technique enables the automatic extraction of entities from the document based on their visual characteristics and relative position in the document layout.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The document layout and extracted cross-references (e.g., captions) may suggest how each text segment relates to each image, examples include (Arasu and Garcia-Molina 2003;Crescenzi et al 2001;Rosenfeld et al 2002). Arasu and Garcia-Molina (2003), Crescenzi et al (2001) and Rosenfeld et al (2002) approaches are based on (manually or semi-automatically extracted) templates that characterise each part of the document. Rosenfeld et al (2002) implement a learning algorithm to extract information such as the author, title and date.…”
Section: Semi-automated Knowledge Acquisitionmentioning
confidence: 99%
“…Rosenfeld et al [6] and Zhai et al [9] suggested a structure extraction method for PDF and Web documents using probabilistic approaches, such as the machine-learning and tree-graph-matching algorithms, respectively. These approaches need to prepare a large amount of annotated data, and the models made from the data are dependent on the data.…”
Section: Introductionmentioning
confidence: 99%