Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

Vadrevu, Srinivas; Gelgi, Fatih; Davulcu, Hasan

doi:10.1007/s11280-007-0021-1

Cited by 20 publications

(18 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The recursive X-Y cut algorithm was initially elaborated in the framework of a system for technical journal analysis [8]. An automated information extraction system is presented in [15] which takes advantage of the Web pages information regularities to organize their content into a hierarchical XML-like structure. Like VIPS the page segmentation algorithm relies on the DOM tree representation of the HTML page.…”

Section: Related Workmentioning

confidence: 99%

Indexing and querying segmented web pages: the BlockWeb Model

Bruno

Faessel²,

Glotin

et al. 2011

World Wide Web

View full text Add to dashboard Cite

We present in this paper a model for indexing and querying web pages, based on the hierarchical decomposition of pages into blocks. Splitting up a page into blocks has several advantages in terms of page design, indexing and querying such as (i) blocks of a page most similar to a query may be returned instead of the page as a whole (ii) the importance of a block can be taken into account, as well as (iii) the permeability of the blocks to neighbor blocks: a block b is said to be permeable to a block b in the same page if b content (text, image, etc.) can be (partially) inherited by b upon indexing. An engine implementing this model is described including: the transformation of web pages into blocks hierarchies, the definition of a dedicated language to express indexing rules and the storage of indexed blocks into an XML repository. The model is assessed on a dataset of electronic news, and a dataset drawn from web pages of the ImagEval campaign where it improves by 16% the mean average precision of the baseline.

show abstract

Section: Related Workmentioning

confidence: 99%

Indexing and querying segmented web pages: the BlockWeb Model

Bruno

Faessel²,

Glotin

et al. 2011

World Wide Web

View full text Add to dashboard Cite

show abstract

“…Embley et al use heuristic rules (Embley, Jiang, & Ng, 1999), which are also used in our research, to discover record boundaries in Web documents. Presentation regularities and domain knowledge are used to extract Web information in the research of Srinivas (Vadrevu, Gelgi, & Davulcu, 2007). Takama (Takama & Mitsuhashi, 2005) analyses layout to calculate visual similarity of Web page for retrieving.…”

Section: Related Workmentioning

confidence: 99%

Tag tree template for Web information and schema extraction

Ji¹,

Zeng²,

Zhang³

et al. 2010

Expert Systems with Applications

View full text Add to dashboard Cite

“…Different kinds of semantics are Lexical Semantics, Statistical Semantics, Structural Semantics, and Prototype Semantics. Srinivas Vadrevu et al (2007) have focused on information extraction from web pages using presentation regularities and domain knowledge. They argued that there is a need to divide a web page into information blocks or several segments before organizing the content into hierarchical groups and during this process (partition a web page) some of the attribute labels of values may be missing.…”

Section: Semantic-basedmentioning

confidence: 99%

“…When Internet users want to get information about Nokia products for example, they first visit search engines such as Yahoo and Google, and then visit all web sites suggested by the search engine. Many researchers such as Guntis Arnicans and Girts Karnitis 2006;Sung Won Jung et al 2001;Srinivas Vadrevu et al 2007;and Horacio Saggion et al 2008 work on extraction of information from web data sources in different domains (traveling, products, business intelligence) but these researches deal with limited web data sources and users still need to use the search engines such as Yahoo and Google to collect more information. We proposed a framework for extracting information from different web data sources.…”

Section: Introductionmentioning

confidence: 99%