An Automatic Data Grabber for Large Web Sites

Crescenzi, Valter; Mecca, Giansalvatore; Merialdo, Paolo; Missier, Paolo

doi:10.1016/b978-012088469-8/50137-6

Cited by 99 publications

(146 citation statements)

References 3 publications

Supporting

Mentioning

144

Contrasting

Unclassified

Order By: Relevance

“…Good surveys are given by Chang et al (2006) and Laender et al (2002). Research in this field is either template-dependent (Zhai and Liu 2005;Zhao et al 2005;Crescenzi et al 2001;Liu et al 2003;Simon and Lausen 2005;Shi et al 2005) or template-independent (Zhu et al 2005(Zhu et al , 2006(Zhu et al , 2007Wang et al 2009). …”

Section: Related Workmentioning

confidence: 95%

Extracting multiple news attributes based on visual features

2011

View full text Add to dashboard Cite

The problem of automatically extracting multiple news attributes from news pages is studied in this paper. Most previous work on web news article extraction focuses only on content. To meet a growing demand for web data integration applications, more useful news attributes, such as title, publication date, author, etc., need to be extracted from news pages and stored in a structured way for further processing. An automatic unified approach to extract such attributes based on their visual features, including independent and dependent visual features, is proposed. Unlike conventional methods, such as extracting attributes separately or generating template-dependent wrappers, the basic idea of this approach is twofold. First, candidates for each news attribute are extracted from the page based on their independent visual features. Second, the true value of each attribute is identified from the candidates based on dependent visual features such as the layout relationships among the attributes. Extensive experiments with a large number of news pages show that the proposed approach is highly effective and efficient.

show abstract

Section: Related Workmentioning

confidence: 95%

Extracting multiple news attributes based on visual features

2011

View full text Add to dashboard Cite

show abstract

“…However, the quality of the extracted data was unlikely suitable for subsequent data mining tasks. ROADRUNNER [10] attempts to solve the problem by eliminating the need for training example preparation. The idea is based on the difference and the similarity of the text content of the Web pages.…”

Section: Related Workmentioning

confidence: 99%

Learning to adapt cross language information extraction wrapper

Wong

2011

Appl Intell

View full text Add to dashboard Cite

We propose a framework for adapting a previously learned wrapper from a source Web site to unseen sites in different languages. To achieve this, we exploit the previously learned information extraction knowledge and the previously extracted or collected items in the source Web site. These knowledge and data are automatically translated to the same language as the unseen sites via online Web resources such as online Web dictionaries or maps. Site independent features which capture the characteristics of the content of the data are then derived from the translated information. Several text mining methods are employed to automatically discover a set of machine labeled training examples in the unseen site. Both content oriented features and site dependent features of the machine labeled training examples are used for learning the new wrapper for the new unseen site using our language independent wrapper induction component. We conducted experiments on some realworld Web sites in different languages to demonstrate the effectiveness of our framework.

show abstract

“…Similar to ours, some of these works are targeted for implicitly schematic Web pages. The works in [2,7,22] address the problem of automatic schema learning of template-driven Web pages. In [44], record segmentation and identification techniques combining information from "list" and "detailed" Web pages, which respectively display list of records and detailed information for individual records, were described.…”

Section: Semantic Webmentioning

confidence: 99%

Automated Semantic Analysis of Schematic Data

Mukherjee

Ramakrishnan

2008

World Wide Web

View full text Add to dashboard Cite

Content in numerous Web data sources, designed primarily for human consumption, are not directly amenable to machine processing. Automated semantic analysis of such content facilitates their transformation into machine-processable and richly structured semantically annotated data. This paper describes a learningbased technique for semantic analysis of schematic data which are characterized by being template-generated from backend databases. Starting with a seed set of handlabeled instances of semantic concepts in a set of Web pages, the technique learns statistical models of these concepts using light-weight content features. These models direct the annotation of diverse Web pages possessing similar content semantics. The principles behind the technique find application in information retrieval and extraction problems. Focused Web browsing activities require only selective fragments of particular Web pages but are often performed using bookmarks which fetch the contents of the entire page. This results in information overload for users of constrained interaction modality devices such as small-screen handheld devices. Fine-grained information extraction from Web pages, which are typically performed using page specific and syntactic expressions known as wrappers, suffer from lack of scalability and robustness. We report on the application of our technique This work has been conducted while the author was at Stony Brook University. 428World Wide Web (2008) 11:427-464 in developing semantic bookmarks for retrieving targeted browsing content and semantic wrappers for robust and scalable information extraction from Web pages sharing a semantic domain.

show abstract

An Automatic Data Grabber for Large Web Sites

Cited by 99 publications

References 3 publications

Extracting multiple news attributes based on visual features

Extracting multiple news attributes based on visual features

Learning to adapt cross language information extraction wrapper

Automated Semantic Analysis of Schematic Data

Contact Info

Product

Resources

About