Towards a unified solution

Lam, Wai; Gu, Yuan

doi:10.1145/2063576.2063761

Cited by 22 publications

(8 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Record Segmentation Tree [19], or RST for short, is intended to extract data records from all of the data regions in a web document. It is based on the hypothesis that similar data records in a contiguous region compose a data region, that data records inside a data region are formatted using similar HTML tags, that a data record consists of a collection of subtrees, and that these subtrees share a parent node.…”

Section: Rst: Record Segmentation Treementioning

confidence: 99%

“…VIPS [24] supports multilevel nesting because it partitions regions in a hierarchical structure in which it maintains the relationships between the parent and child subregions, thus allowing to detect nested subregions. TPC [114] and RST [19] support multilevel nesting because they are able to detect nested regions and to infer the relationship between parent and child regions. Note that proposals that use VIPS, such as RIPB [84] and VSDR [97], are not considered to extract nested data regions (zerolevel) because the former extracts one data region only and the latter does not maintain the relationships between regions.…”

Section: Input and Output Dimensionmentioning

confidence: 99%

“…That implies that as the complexity of typical web documents increases, information extractors have to analyze more and more irrelevant regions, which has an impact on both efficiency and effectiveness [84], [163], [175]. This has motivated a number of authors to work on region extractors as a means to relieve information extractors from the burden of analyzing many regions of a web document that do not contain any relevant information [19], [23], [24], [53], [84], [97], [100], [114], [125], [141], [163], [169], [179], [180]. The difference between information extractors and region extractors is that the former focus on extracting and structuring data records and their attributes, whereas the latter focus on identifying the HTML fragments that contain this information.…”

Section: Introductionmentioning

confidence: 99%

“…The literature records an increasing number of proposals in this area [19], [23], [24], [53], [84], [97], [100], [114], [125], [141], [163], [169], [179], [180]. Unfortunately, none of the surveys regarding information extraction that we have found in the literature take them into account [35], [91], [92], [95], [121], [137], [157].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A Survey on Region Extractors from Web Documents

Sleiman

Corchuelo

2013

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Extracting information from web documents has become a research area in which new proposals sprout out year after year. This has motivated several researchers to work on surveys that attempt to provide an overall picture of the many existing proposals. Unfortunately, none of these surveys provide a complete picture, because they do not take region extractors into account. These tools are kind of preprocessors, because they help information extractors focus on the regions of a web document that contain relevant information. With the increasing complexity of web documents, region extractors are becoming a must to extract information from many websites. Beyond information extraction, region extractors have also found their way into information retrieval, focused web crawling, topic distillation, adaptive content delivery, mashups, and metasearch engines. In this paper, we survey the existing proposals regarding region extractors and compare them side by side.

show abstract

Section: Rst: Record Segmentation Treementioning

confidence: 99%

Section: Input and Output Dimensionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Survey on Region Extractors from Web Documents

Sleiman

Corchuelo

2013

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

show abstract

“…DEPTA [21] -an extension of the work reported in [22] -first processes the page using a Web browser in order to get the boundaries information of each DOM node and later detects nested rectangles -thus building a tag tree where the parent relationship indicates a containment in the rendered page. DEPTA utilizes a string edit distance to cluster similar nodes into regions -a similar technique used by [23], while replacing the tree edit distance with a token edit distance. MiBAT [14] -an automatic extraction framework of Web data record containing user-generated content -relies on domain constraints to acquire anchor points information.…”

Section: Related Workmentioning

confidence: 99%

Mining User-Generated Comments

Subercaze

Gravier

Laforest

2015

2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)

View full text Add to dashboard Cite

Social-media websites, such as newspapers, blogs, and forums, are the main places of generation and exchange of user-generated comments. These comments are viable sources for opinion mining, descriptive annotations and information extraction. User-generated comments are formatted using a HTML template, they are therefore entwined with the other information in the HTML document. Their unsupervised extraction is thus a taxing issue-even greater when considering the extraction of nested answers by different users. This paper presents a novel technique (CommentsMiner) for unsupervised users comments extraction. Our approach uses both the theoretical framework of frequent subtree mining and data extraction techniques. We demonstrate that the comment mining task can be modelled as a constrained closed induced subtree mining problem followed by a learning-to-rank problem. Our experimental evaluations show that CommentsMiner solves the plain comments and nested comments extraction problems for 84% of a representative and accessible dataset, while outperforming existing baselines techniques.

show abstract