Proceedings 2004 VLDB Conference 2004
DOI: 10.1016/b978-012088469-8/50137-6
|View full text |Cite
|
Sign up to set email alerts
|

An Automatic Data Grabber for Large Web Sites

Abstract: We demonstrate a system to automatically grab data from data intensive web sites. The system first infers a model that describes at the intensional level the web site as a collection of classes; each class represents a set of structurally homogeneous pages, and it is associated with a small set of representative pages. Based on the model a library of wrappers, one per class, is then inferred, with the help an external wrapper generator. The model, together with the library of wrappers, can thus be used to navi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
144
0
2

Year Published

2005
2005
2021
2021

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 99 publications
(146 citation statements)
references
References 3 publications
0
144
0
2
Order By: Relevance
“…Good surveys are given by Chang et al (2006) and Laender et al (2002). Research in this field is either template-dependent (Zhai and Liu 2005;Zhao et al 2005;Crescenzi et al 2001;Liu et al 2003;Simon and Lausen 2005;Shi et al 2005) or template-independent (Zhu et al 2005(Zhu et al , 2006(Zhu et al , 2007Wang et al 2009). …”
Section: Related Workmentioning
confidence: 95%
“…Good surveys are given by Chang et al (2006) and Laender et al (2002). Research in this field is either template-dependent (Zhai and Liu 2005;Zhao et al 2005;Crescenzi et al 2001;Liu et al 2003;Simon and Lausen 2005;Shi et al 2005) or template-independent (Zhu et al 2005(Zhu et al , 2006(Zhu et al , 2007Wang et al 2009). …”
Section: Related Workmentioning
confidence: 95%
“…However, the quality of the extracted data was unlikely suitable for subsequent data mining tasks. ROADRUNNER [10] attempts to solve the problem by eliminating the need for training example preparation. The idea is based on the difference and the similarity of the text content of the Web pages.…”
Section: Related Workmentioning
confidence: 99%
“…Similar to ours, some of these works are targeted for implicitly schematic Web pages. The works in [2,7,22] address the problem of automatic schema learning of template-driven Web pages. In [44], record segmentation and identification techniques combining information from "list" and "detailed" Web pages, which respectively display list of records and detailed information for individual records, were described.…”
Section: Semantic Webmentioning
confidence: 99%