2002 IEEE International Conference on Data Mining, 2002. Proceedings.
DOI: 10.1109/icdm.2002.1183910
|View full text |Cite
|
Sign up to set email alerts
|

Recognition of common areas in a Web page using visual information: a possible application in a page classification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
52
0

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 68 publications
(52 citation statements)
references
References 4 publications
0
52
0
Order By: Relevance
“…The manually labeled ones vary in size from 105 to 515, with the exception of the TAP knowledge base (Guha and McCool 2003) at a size of 9,068 which was a semantically labeled database that was used as a test-bed for the Semantic Web but is unfortunately not available anymore. The Web pages are sampled completely at random in (Chakrabarti et al 2008), in (Kohlschütter and Nejdl 2008) they are taken from the Webspam UK-2007 dataset(Crawled by the Laboratory of Web Algorithmics, University of Milan, http://law.dsi.unimi.it/) comprising over 100 million pages, which is focused on labeling hosts into spam/nonspam, in (Kovacevic et al 2002) they first downloaded 16,000 random pages from the directory site www.dmoz.org and randomly chose the sample pages from there. In (Vadrevu et al 2005) they make a distinction between template-driven and non-template-driven Web pages (i.e.…”
Section: The Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…The manually labeled ones vary in size from 105 to 515, with the exception of the TAP knowledge base (Guha and McCool 2003) at a size of 9,068 which was a semantically labeled database that was used as a test-bed for the Semantic Web but is unfortunately not available anymore. The Web pages are sampled completely at random in (Chakrabarti et al 2008), in (Kohlschütter and Nejdl 2008) they are taken from the Webspam UK-2007 dataset(Crawled by the Laboratory of Web Algorithmics, University of Milan, http://law.dsi.unimi.it/) comprising over 100 million pages, which is focused on labeling hosts into spam/nonspam, in (Kovacevic et al 2002) they first downloaded 16,000 random pages from the directory site www.dmoz.org and randomly chose the sample pages from there. In (Vadrevu et al 2005) they make a distinction between template-driven and non-template-driven Web pages (i.e.…”
Section: The Datasetsmentioning
confidence: 99%
“…In (Kovacevic et al 2002), the approach is based on heuristics that take visual information into account. They built their own basic browser engine to accomplish this, but do not take style sheets into account, and they avoid calculating rendering information for every node in the HTML tree.…”
Section: Related Workmentioning
confidence: 99%
“…This way can make possible even to search hidden web pages [3]. Similar to this there is are the most popular DOM-based segmentation [5], Location-Based Segmentation [10] and Vision-Based Page Segmentation [4]. The paper deals with ability of differentiating features of the web page as blocks.…”
Section: Related Workmentioning
confidence: 99%
“…Gu et al [16] describe a top-down approach to segment a web page and detect its content structure by dividing and merging blocks. Kovacevic et al [19] use visual information to build up a "M-tree", a concept similar to the DOM tree enhanced with screen coordinates. They then use further defined heuristics to recognize common page areas such as header, left and right menu, footer and center of a page.…”
Section: Related Workmentioning
confidence: 99%