Proceedings of the 20th ACM International Conference on Information and Knowledge Management 2011
DOI: 10.1145/2063576.2063761
|View full text |Cite
|
Sign up to set email alerts
|

Towards a unified solution

Abstract: Although the task of data record extraction from Web pages has been studied extensively, yet it fails to handle many pages due to their complexity in format or layout. In this paper, we propose a unified method to tackle this task by addressing several key issues in a uniform manner. A new search structure, named as Record Segmentation Tree (RST), is designed, and several efficient search pruning strategies on the RST structure are proposed to identify the records in a given Web page. Another characteristic of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2013
2013
2017
2017

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 22 publications
(8 citation statements)
references
References 39 publications
0
8
0
Order By: Relevance
“…Record Segmentation Tree [19], or RST for short, is intended to extract data records from all of the data regions in a web document. It is based on the hypothesis that similar data records in a contiguous region compose a data region, that data records inside a data region are formatted using similar HTML tags, that a data record consists of a collection of subtrees, and that these subtrees share a parent node.…”
Section: Rst: Record Segmentation Treementioning
confidence: 99%
See 3 more Smart Citations
“…Record Segmentation Tree [19], or RST for short, is intended to extract data records from all of the data regions in a web document. It is based on the hypothesis that similar data records in a contiguous region compose a data region, that data records inside a data region are formatted using similar HTML tags, that a data record consists of a collection of subtrees, and that these subtrees share a parent node.…”
Section: Rst: Record Segmentation Treementioning
confidence: 99%
“…VIPS [24] supports multilevel nesting because it partitions regions in a hierarchical structure in which it maintains the relationships between the parent and child subregions, thus allowing to detect nested subregions. TPC [114] and RST [19] support multilevel nesting because they are able to detect nested regions and to infer the relationship between parent and child regions. Note that proposals that use VIPS, such as RIPB [84] and VSDR [97], are not considered to extract nested data regions (zerolevel) because the former extracts one data region only and the latter does not maintain the relationships between regions.…”
Section: Input and Output Dimensionmentioning
confidence: 99%
See 2 more Smart Citations
“…DEPTA [21] -an extension of the work reported in [22] -first processes the page using a Web browser in order to get the boundaries information of each DOM node and later detects nested rectangles -thus building a tag tree where the parent relationship indicates a containment in the rendered page. DEPTA utilizes a string edit distance to cluster similar nodes into regions -a similar technique used by [23], while replacing the tree edit distance with a token edit distance. MiBAT [14] -an automatic extraction framework of Web data record containing user-generated content -relies on domain constraints to acquire anchor points information.…”
Section: Related Workmentioning
confidence: 99%