Using the structure of Web sites for automatic segmentation of tables

Lerman, Kristina; Getoor, Lise; Minton, Steven; Knoblock, Craig A.

doi:10.1145/1007568.1007584

Cited by 99 publications

(82 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is possible to use Machine Learning algorithms to learn automatically the wrappers [22,23,26,27,31]. The automatic wrapper induction [22,31] Each landmark automaton is specialized in extracting an attribute.…”

Section: Machine Learningmentioning

confidence: 99%

“…Maintaining wrappers is related to two different issues: on the one hand, to detect when a wrapper is not retrieving correctly the data (wrapper verification). On the other hand, to automatically recover the wrapper generating a new wrapper that takes into account the possible changes in the Web source (wrapper reinduction) [23,26,27]. …”

Section: Machine Learningmentioning

confidence: 99%

See 1 more Smart Citation

Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions

Barrero

Camacho

R-Moreno

2009

Data Mining and Multi-Agent Integration

View full text Add to dashboard Cite

Data Extraction from the World Wide Web is a well known, non solved, and a critical problem when complex information systems are designed. These problems are related to the extraction, management and reuse of the huge amount of Web data available. These data have usually a high heterogeneity, volatility and low quality (i.e. format and content mistakes), so it is quite hard to build realible systems. In this chapter we propose an updated state of the art revision of the problem of Web Data Extraction, and an Evolutionary Computation approach based on Genetic Algorithms and Regular Expressions to the problem of automatically learn software entities. These entities, also called wrappers, will be able to extract some kind of Web data structures from examples.

show abstract

Section: Machine Learningmentioning

confidence: 99%

Section: Machine Learningmentioning

confidence: 99%

Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions

Barrero

Camacho

R-Moreno

2009

Data Mining and Multi-Agent Integration

View full text Add to dashboard Cite

show abstract

“…For automatic extraction, [1,4,6] find patterns or grammars from multiple pages containing similar data records. Requiring an initial set of pages containing similar data records is, however, a limitation.…”

Section: Introductionmentioning

confidence: 99%

“…Requiring an initial set of pages containing similar data records is, however, a limitation. [6] proposes a method that tries to explore the detailed information pages behind the current page to segment data records. The need for such detailed pages behind is a drawback because many data records do not have such pages or such pages are hard to find.…”

Section: Introductionmentioning

confidence: 99%

NET – A System for Extracting Web Data from Flat and Nested Data Records

Liu

Zhai

2005

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. This paper studies automatic extraction of structured data from Web pages. Each of such pages may contain several groups of structured data records. Existing automatic methods still have several limitations. In this paper, we propose a more effective method for the task. Given a page, our method first builds a tag tree based on visual information. It then performs a post-order traversal of the tree and matches subtrees in the process using a tree edit distance method and visual cues. After the process ends, data records are found and data items in them are aligned and extracted. The method can extract data from both flat and nested data records. Experimental evaluation shows that the method performs the extraction task accurately.

show abstract

“…Motivated by this observation, recently several researchers have studied techniques that exploit similarities and differences among pages generated by the same script in order to automatically infer a Web wrapper, i.e. a program to extract and organize in a structured format data from HTML pages (Arasu and Garcia-Molina, 2003;Crescenzi et al, 2001;Crescenzi and Mecca, 2004;Lerman et al, 2004;Wang and Lochovsky, 2002). Based on these techniques they have developed tools that, given a set of pages sharing the same structure, infer a wrapper, which can be used in order to extract the data from all pages conforming to that structure.…”

Section: Introductionmentioning

confidence: 99%

Efficiently Locating Collections of Web Pages to Wrap

Blanco

Crescenzi

Merialdo

2005

Proceedings of the First International Conference on Web Information Systems and Technologies

View full text Add to dashboard Cite

Abstract:Many large web sites contain highly valuable information. Their pages are dynamically generated by scripts which retrieve data from a back-end database and embed them into HTML templates. Based on this observation several techniques have been developed to automatically extract data from a set of structurally homogeneous pages. These tools represent a step towards the automatic extraction of data from large web sites, but currently their input sample pages have to be manually collected. To scale the data extraction process this task should be automated, as well. We present techniques to automatically gathering structurally similar pages from large web sites. We have developed an algorithm that takes as input one sample page, and crawls the site to find pages similar in structure to the given page. The collected pages can feed an automatic wrapper generator to extract data. Experiments conducted over real life web sites gave us encouraging results.

show abstract

Using the structure of Web sites for automatic segmentation of tables

Cited by 99 publications

References 28 publications

Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions

Automatic Web Data Extraction Based on Genetic Algorithms and Regular Expressions

NET – A System for Extracting Web Data from Flat and Nested Data Records

Efficiently Locating Collections of Web Pages to Wrap

Contact Info

Product

Resources

About