2007
DOI: 10.1007/s11280-007-0021-1
|View full text |Cite
|
Sign up to set email alerts
|

Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

Abstract: World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We al… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
18
0

Year Published

2008
2008
2013
2013

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 20 publications
(18 citation statements)
references
References 20 publications
0
18
0
Order By: Relevance
“…The recursive X-Y cut algorithm was initially elaborated in the framework of a system for technical journal analysis [8]. An automated information extraction system is presented in [15] which takes advantage of the Web pages information regularities to organize their content into a hierarchical XML-like structure. Like VIPS the page segmentation algorithm relies on the DOM tree representation of the HTML page.…”
Section: Related Workmentioning
confidence: 99%
“…The recursive X-Y cut algorithm was initially elaborated in the framework of a system for technical journal analysis [8]. An automated information extraction system is presented in [15] which takes advantage of the Web pages information regularities to organize their content into a hierarchical XML-like structure. Like VIPS the page segmentation algorithm relies on the DOM tree representation of the HTML page.…”
Section: Related Workmentioning
confidence: 99%
“…Embley et al use heuristic rules (Embley, Jiang, & Ng, 1999), which are also used in our research, to discover record boundaries in Web documents. Presentation regularities and domain knowledge are used to extract Web information in the research of Srinivas (Vadrevu, Gelgi, & Davulcu, 2007). Takama (Takama & Mitsuhashi, 2005) analyses layout to calculate visual similarity of Web page for retrieving.…”
Section: Related Workmentioning
confidence: 99%
“…Different kinds of semantics are Lexical Semantics, Statistical Semantics, Structural Semantics, and Prototype Semantics. Srinivas Vadrevu et al (2007) have focused on information extraction from web pages using presentation regularities and domain knowledge. They argued that there is a need to divide a web page into information blocks or several segments before organizing the content into hierarchical groups and during this process (partition a web page) some of the attribute labels of values may be missing.…”
Section: Semantic-basedmentioning
confidence: 99%
“…When Internet users want to get information about Nokia products for example, they first visit search engines such as Yahoo and Google, and then visit all web sites suggested by the search engine. Many researchers such as Guntis Arnicans and Girts Karnitis 2006;Sung Won Jung et al 2001;Srinivas Vadrevu et al 2007;and Horacio Saggion et al 2008 work on extraction of information from web data sources in different domains (traveling, products, business intelligence) but these researches deal with limited web data sources and users still need to use the search engines such as Yahoo and Google to collect more information. We proposed a framework for extracting information from different web data sources.…”
Section: Introductionmentioning
confidence: 99%