Automatic web news extraction using tree edit distance

Reis, Davi C.; Golgher, Paulo B.; Silva, A. S.; Laender, Alberto

doi:10.1145/988672.988740

Cited by 239 publications

(141 citation statements)

References 26 publications

Supporting

Mentioning

137

Contrasting

Unclassified

Order By: Relevance

“…Alternatively, there are fully automated techniques [7][8][9], which require a single URL as input and aim to return a set of objects extracted from the content. Though these tools are considered to be easier to use, their output generally requires more attention than that of semi-automated solutions.…”

Section: Related Workmentioning

confidence: 99%

“…Most of these rely on the DOM tree representation, also called tag tree, of the document [4,5,[7][8][9][10][11], though some also look at the way the data is visualized [3,12] using optical recognition software. Such techniques often focus on frequency and pattern analysis of subtrees to estimate where data of interest can be found.…”

Section: Related Workmentioning

confidence: 99%

“…In sections 7 and 8, we explain how we can use these XPaths to configure a new wrapper. In section 9 we present a use case to measure the accuracy of our research. In Section 10, we indicate what might still improve our work and finally, in Section 11 we present our final conclusions.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Data Driven XPath Generation

Mol

Bronselaer

Nielandt

et al. 2015

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Abstract. The XPath query language offers a standard for information extraction from HTML documents. Therefore, the DOM tree representation is typically used, which models the hierarchical structure of the document. One of the key aspects of HTML is the separation of data and the structure that is used to represent it. A consequence thereof is that data extraction algorithms usually fail to identify data if the structure of a document is changed. In this paper, it is investigated how a set of tabular oriented XPath queries can be adapted in such a way it deals with modifications in the DOM tree of an HTML document. The basic idea is hereby that if data has already been extracted in the past, it could be used to reconstruct XPath queries that retrieve the data from a different DOM tree. Experimental results show the accuracy of our method.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Data Driven XPath Generation

Mol

Bronselaer

Nielandt

et al. 2015

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

show abstract

“…It has also been investigated in the context of Web data extraction, using structural patterns with special VLDC symbols (substituting single nodes, sets of siblings, and/or sub-trees) to identify data-rich information in Web document [68]. Nonetheless, despite being comparable, VLDC symbols are fairly different from repeatability and alternativeness operators in XML grammars (cf.…”

Section: Approximate Pattern Matching With Vldcmentioning

confidence: 99%

“…Clustering can also be critical in information extraction. Current information extraction methods either implicitly or explicitly depend on the structural features of documents [17,68]. Structural clustering allows to automatically identify the sets of XML documents and/or document patterns that are useful in information extraction algorithms, in order to produce meaningful results [68].…”

Section: Xml Document Clusteringmentioning

confidence: 99%

XML document-grammar comparison: related problems and applications

Tekli

Chbeir

Traina

et al. 2011

Open Computer Science

View full text Add to dashboard Cite

10.2478/s13537-011-0005-1International audienceXML document comparison is becoming an ever more popular research issue due to the increasingly abundant use of XML. Likewise, a growing interest fosters the development of XML grammar matching and comparison, due to the proliferation of heterogeneous XML data sources, particularly on the Web. Nonetheless, the process of comparing XML documents with XML grammars, i.e., XML document and grammar similarity evaluation, has not yet received the attention it deserves. In this paper, we provide an overview on existing research related to XML document/grammar comparison, presenting the background and discussing the various techniques related to the problem. We also discuss some prominent application domains, ranging over document classification and clustering, document transformation, grammar evolution, selective dissemination of XML information, XML querying, as well as alert filtering in intrusion detection systems and Web Services matching and communications

show abstract

A flexible approach for extracting metadata from bibliographic citations

Cortez

Silva

Gonçalves

et al. 2009

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

In this article we present FLUX-CiM, a novel method for extracting components (e.g., author names, article titles, venues, page numbers) from bibliographic citations. Our method does not rely on patterns encoding specific delimiters used in a particular citation style. This feature yields a high degree of automation and flexibility, and allows FLUX-CiM to extract from citations in any given format. Differently from previous methods that are based on models learned from user-driven training, our method relies on a knowledge base automatically constructed from an existing set of sample metadata records from a given field (e.g., computer science, health sciences, social sciences, etc.). These records are usually available on the Web or other public data repositories. To demonstrate the effectiveness and applicability of our proposed method, we present a series of experiments in which we apply it to extract bibliographic data from citations in articles of different fields. Results of these experiments exhibit precision and recall levels above 94% for all fields, and perfect extraction for the large majority of citations tested. In addition, in a comparison against a stateof-the-art information-extraction method, ours produced superior results without the training phase required by that method. Finally, we present a strategy for using bibliographic data resulting from the extraction process with FLUX-CiM to automatically update and expand the knowledge base of a given domain. We show that this strategy can be used to achieve good extraction results even if only a very small initial sample of bibliographic records is available for building the knowledge base.

show abstract

Automatic web news extraction using tree edit distance

Cited by 239 publications

References 26 publications

Data Driven XPath Generation

Data Driven XPath Generation

XML document-grammar comparison: related problems and applications

A flexible approach for extracting metadata from bibliographic citations

Contact Info

Product

Resources

About