FiVaTech: Page-Level Web Data Extraction from Template Pages

Wang, Jie; Zhang, Jun; Lian, Liu; Han, Deping

doi:10.1109/icdmw.2007.95

Cited by 21 publications

(35 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We annotated in each dataset the relevant information and then each string item extracted by our proposal was considered as a true positive (tp), false negative (f n), or false positive (f n). We are interested in measuring precision P = We used our collection of datasets to compare our proposal to RoadRunner [5] and to FiVaTech [6], cf. Table 1.…”

Section: Resultsmentioning

confidence: 99%

“…These rules can be handcrafted, learnt using semi-supervised techniques that require the user to provide some annotated training documents [3,4], or unsupervised techniques that learn extraction rules for all the information they consider as relevant inside some training documents [5,6]. Rule-based information extractors need to be maintained or even rewritten if the web source on which they were trained changes [7].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards a Method for Unsupervised Web Information Extraction

Sleiman

Corchuelo

2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. The literature provides a variety of techniques to build the information extractors on which some data integration systems rely. Information extraction techniques are usually based on extraction rules that require maintenance and adaptation if web sources change. We present our preliminary steps towards an unsupervised information extraction technique that searches web documents for shared patterns and fragments them until finding the relevant information that should be extracted. Experimental results on 1230 real-web documents demonstrate that our system performs fast and achieves promising results.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Towards a Method for Unsupervised Web Information Extraction

Sleiman

Corchuelo

2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Unlike other DOM tree based techniques [20], [40], it does not require processing the entire DOM tree to identify location of attribute value pairs. Usually they represent text nodes and text nodes are always leaf nodes in the DOM tree.…”

Section: A Advantagesmentioning

confidence: 99%

“…It filters those equivalence classes which are large and frequently occurring in most of the pages. FIVATECH [20] uses DOM trees of the web pages to deduce schema. They perform merging of the DOM trees into fixed/variant pattern tree.…”

Section: Related Workmentioning

confidence: 99%

“…are semi-supervised approaches which provides sophisticated GUI to guide the extraction process. Further research in the area of web data extraction, lead to the era of fully automatic or unsupervised web data extractors such as RoadRunner [7], EXALG [2], DEPTA [40], DELA [37], FIVATECH [20] and TRINITY [34]. RoadRunner [7] starts with a sample page and creates a Union Free Regular Expression representing wrapper.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Web Data Extraction from Scientific Publishers’ Website Using Heuristic Algorithm

Kumaresan¹,

Kalpana²

2017

IJISA

View full text Add to dashboard Cite

Abstract-WWW is a huge repository of information and the amount of information available on the web is growing day by day in an exponential manner. End users make use of search engines like Google, Yahoo, and Bingo etc. for retrieving information. Search engines use web crawlers or spiders which crawl through a sequence of web pages in order to locate the relevant pages and provide a set of links ordered by relevancy. Those indexed web pages are part of surface web. Getting data from deep web requires form submission and is not performed by search engines. Data analytics and data mining applications depend on data from deep web pages and automatic extraction of data from deep web is cumbersome due to diverse structure of web pages. In the proposed work, a heuristic algorithm for automatic navigation and information extraction from journal's home page has been devised. The algorithm is applied to many publishers website such as Nature, Elsevier, BMJ, Wiley etc. and the experimental results show that the heuristic technique provides promising results with respect to precision and recall values.

show abstract

A Novel Approach to Web Information Extraction

Quintero¹,

Jiménez²,

Corchuelo³

2015

Business Information Systems

View full text Add to dashboard Cite

Business Intelligence requires the acquisition and aggregation of key pieces of knowledge from multiple sources in order to provide valuable information to customers. The Web is the largest source of information nowadays. Unfortunately, the information it provides is available in semi-structured human-friendly formats, which makes it difficult to be processed by automated business processes. Classical propositional and ILP machine-learning techniques have been applied for this purpose. However, the former have not enough expressive power, whereas the latter are more expressive but intractable with large datasets. Propositionalisation was devised as a means to provide propositional techniques with more expressive power, enabling them to exploit structural information in a propositional way that allows them to be efficient. In this paper, we present a proposal to extract information from semi-structured web documents that uses this approach. It leverages a classical propositional machine learning technique and enhances it with the ability to learn from an unbounded context, which helps increase its precision and recall. Our experiments prove that our proposal outperforms other stateof-art techniques in the literature.

show abstract

FiVaTech: Page-Level Web Data Extraction from Template Pages

Cited by 21 publications

References 12 publications

Towards a Method for Unsupervised Web Information Extraction

Towards a Method for Unsupervised Web Information Extraction

Web Data Extraction from Scientific Publishers’ Website Using Heuristic Algorithm

A Novel Approach to Web Information Extraction

Contact Info

Product

Resources

About