Recognition of common areas in a Web page using visual information: a possible application in a page classification

Kovačević, M.; Diligenti, Michelangelo; Gori, Marco; Milutinović, Veljko

doi:10.1109/icdm.2002.1183910

Cited by 68 publications

(52 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The manually labeled ones vary in size from 105 to 515, with the exception of the TAP knowledge base (Guha and McCool 2003) at a size of 9,068 which was a semantically labeled database that was used as a test-bed for the Semantic Web but is unfortunately not available anymore. The Web pages are sampled completely at random in (Chakrabarti et al 2008), in (Kohlschütter and Nejdl 2008) they are taken from the Webspam UK-2007 dataset(Crawled by the Laboratory of Web Algorithmics, University of Milan, http://law.dsi.unimi.it/) comprising over 100 million pages, which is focused on labeling hosts into spam/nonspam, in (Kovacevic et al 2002) they first downloaded 16,000 random pages from the directory site www.dmoz.org and randomly chose the sample pages from there. In (Vadrevu et al 2005) they make a distinction between template-driven and non-template-driven Web pages (i.e.…”

Section: The Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

A Quantitative Comparison of Semantic Web Page Segmentation Approaches

Kreuzer

Hage

Feelders

2015

Engineering the Web in the Big Data Era

View full text Add to dashboard Cite

This paper explores the effectiveness of different semantic web page segmentation algorithms on modern websites. We compare three known algorithms each serving as an example of a particular approach to the problem, and one self-developed algorithm, WebTerrain, that combines two of the approaches. With our testing framework we have compared the performance of four algorithms for a large benchmark we have constructed. We have examined each algorithm for a total of eight different configurations (varying datasets, evaluation metric and the type of the input HTML documents). We found that all algorithms performed better on random pages on average than on popular pages, and results are better when running the algorithms on the HTML obtained from the DOM rather than on the plain HTML. Overall there is much room for improvement as we find the best average F-score to be 0.49, indicating that for modern websites currently available algorithms are not yet of practical use.

show abstract

Section: The Datasetsmentioning

confidence: 99%

“…In (Kovacevic et al 2002), the approach is based on heuristics that take visual information into account. They built their own basic browser engine to accomplish this, but do not take style sheets into account, and they avoid calculating rendering information for every node in the HTML tree.…”

Section: Related Workmentioning

confidence: 99%

A Quantitative Comparison of Semantic Web Page Segmentation Approaches

Kreuzer

Hage

Feelders

2015

Engineering the Web in the Big Data Era

View full text Add to dashboard Cite

show abstract

“…This way can make possible even to search hidden web pages [3]. Similar to this there is are the most popular DOM-based segmentation [5], Location-Based Segmentation [10] and Vision-Based Page Segmentation [4]. The paper deals with ability of differentiating features of the web page as blocks.…”

Section: Related Workmentioning

confidence: 99%

HIGWGET-A Model for Crawling Secure Hidden WebPages

Bharati¹,

Premchand

Govardhan

2013

IJDKP

View full text Add to dashboard Cite

show abstract

“…Gu et al [16] describe a top-down approach to segment a web page and detect its content structure by dividing and merging blocks. Kovacevic et al [19] use visual information to build up a "M-tree", a concept similar to the DOM tree enhanced with screen coordinates. They then use further defined heuristics to recognize common page areas such as header, left and right menu, footer and center of a page.…”

Section: Related Workmentioning

confidence: 99%

Towards domain-independent information extraction from web tables

Gatterbauer

Bohunsky

Herzog

et al. 2007

Proceedings of the 16th International Conference on World Wide Web

178

148

View full text Add to dashboard Cite

Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of

tags. A multitude of different HTML implementations of web tables make these approaches difficult to scale. In this paper, we approach the problem of domain-independent information extraction from web tables by shifting our attention from the tree-based representation of web pages to a variation of the two-dimensional visual box model used by web browsers to display the information on the screen. The thereby obtained topological and style information allows us to fill the gap created by missing domain-specific knowledge about content and table templates. We believe that, in a future step, this approach can become the basis for a new way of large-scale knowledge acquisition from the current "Visual Web."

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Recognition of common areas in a Web page using visual information: a possible application in a page classification

Cited by 68 publications

References 4 publications

A Quantitative Comparison of Semantic Web Page Segmentation Approaches

A Quantitative Comparison of Semantic Web Page Segmentation Approaches

HIGWGET-A Model for Crawling Secure Hidden WebPages

Towards domain-independent information extraction from web tables

Contact Info

Product

Resources

About