Proceedings of the 13th International Conference on World Wide Web 2004
DOI: 10.1145/988672.988740
|View full text |Cite
|
Sign up to set email alerts
|

Automatic web news extraction using tree edit distance

Abstract: The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results. In this paper, we present a domain-oriented approach to Web data extract… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
137
0
3

Year Published

2005
2005
2015
2015

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 239 publications
(141 citation statements)
references
References 26 publications
1
137
0
3
Order By: Relevance
“…Alternatively, there are fully automated techniques [7][8][9], which require a single URL as input and aim to return a set of objects extracted from the content. Though these tools are considered to be easier to use, their output generally requires more attention than that of semi-automated solutions.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Alternatively, there are fully automated techniques [7][8][9], which require a single URL as input and aim to return a set of objects extracted from the content. Though these tools are considered to be easier to use, their output generally requires more attention than that of semi-automated solutions.…”
Section: Related Workmentioning
confidence: 99%
“…Most of these rely on the DOM tree representation, also called tag tree, of the document [4,5,[7][8][9][10][11], though some also look at the way the data is visualized [3,12] using optical recognition software. Such techniques often focus on frequency and pattern analysis of subtrees to estimate where data of interest can be found.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…It has also been investigated in the context of Web data extraction, using structural patterns with special VLDC symbols (substituting single nodes, sets of siblings, and/or sub-trees) to identify data-rich information in Web document [68]. Nonetheless, despite being comparable, VLDC symbols are fairly different from repeatability and alternativeness operators in XML grammars (cf.…”
Section: Approximate Pattern Matching With Vldcmentioning
confidence: 99%
“…Clustering can also be critical in information extraction. Current information extraction methods either implicitly or explicitly depend on the structural features of documents [17,68]. Structural clustering allows to automatically identify the sets of XML documents and/or document patterns that are useful in information extraction algorithms, in order to produce meaningful results [68].…”
Section: Xml Document Clusteringmentioning
confidence: 99%