Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data 2004
DOI: 10.1145/1007568.1007584
|View full text |Cite
|
Sign up to set email alerts
|

Using the structure of Web sites for automatic segmentation of tables

Abstract: Many Web sites, especially those that dynamically generate HTML pages to display the results of a user's query, present information in the form of list or tables. Current tools that allow applications to programmatically extract this information rely heavily on user input, often in the form of labeled extracted records. The sheer size and rate of growth of the Web make any solution that relies primarily on user input is infeasible in the long term. Fortunately, many Web sites contain much explicit and implicit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
82
0

Year Published

2005
2005
2022
2022

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 99 publications
(82 citation statements)
references
References 28 publications
0
82
0
Order By: Relevance
“…It is possible to use Machine Learning algorithms to learn automatically the wrappers [22,23,26,27,31]. The automatic wrapper induction [22,31] Each landmark automaton is specialized in extracting an attribute.…”
Section: Machine Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…It is possible to use Machine Learning algorithms to learn automatically the wrappers [22,23,26,27,31]. The automatic wrapper induction [22,31] Each landmark automaton is specialized in extracting an attribute.…”
Section: Machine Learningmentioning
confidence: 99%
“…Maintaining wrappers is related to two different issues: on the one hand, to detect when a wrapper is not retrieving correctly the data (wrapper verification). On the other hand, to automatically recover the wrapper generating a new wrapper that takes into account the possible changes in the Web source (wrapper reinduction) [23,26,27]. …”
Section: Machine Learningmentioning
confidence: 99%
“…For automatic extraction, [1,4,6] find patterns or grammars from multiple pages containing similar data records. Requiring an initial set of pages containing similar data records is, however, a limitation.…”
Section: Introductionmentioning
confidence: 99%
“…Requiring an initial set of pages containing similar data records is, however, a limitation. [6] proposes a method that tries to explore the detailed information pages behind the current page to segment data records. The need for such detailed pages behind is a drawback because many data records do not have such pages or such pages are hard to find.…”
Section: Introductionmentioning
confidence: 99%
“…Motivated by this observation, recently several researchers have studied techniques that exploit similarities and differences among pages generated by the same script in order to automatically infer a Web wrapper, i.e. a program to extract and organize in a structured format data from HTML pages (Arasu and Garcia-Molina, 2003;Crescenzi et al, 2001;Crescenzi and Mecca, 2004;Lerman et al, 2004;Wang and Lochovsky, 2002). Based on these techniques they have developed tools that, given a set of pages sharing the same structure, infer a wrapper, which can be used in order to extract the data from all pages conforming to that structure.…”
Section: Introductionmentioning
confidence: 99%