Web Content Extraction by Weighing the Fundamental Contextual Rules

Mohammadi, Mehdi; Shayegan, Mohammad Javad; Latifi, Nima

doi:10.1109/icspis54653.2021.9729342

Search citation statements

Order By: Relevance

Paper Sections

Select...

Block Detection1

Related Work1

Citation Types

Supporting

Mentioning

Contrasting

Publication Types

Select...

Other1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

(2 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides the classification based on the technique used to build the wrapper (described in Section 2.4), block detection techniques can be further classified depending on the way in which they internally represent the web pages: (i) web pages are treated as HTML code, (ii) web pages are treated as a rendered image, and (iii) web pages are treated as a DOM tree: i. HTML-based approaches are mainly based on densitometry methods ( [79]) that use the textual information of the web page. Many of them assume that the main content on a web page contains a high text density and a low tag density.…”

Section: Block Detectionmentioning

confidence: 99%

“…Not even the latest block detection techniques (see, e.g., [115,110,124,66,123,121,53,81]) implement another block detection phase as a preprocess. Many techniques implement simple preprocess methods such as removing nodes that surely do not have any content to extract (see, e.g., [115,90,110]) or standardizing the HTML code and precleaning it (see, e.g., [105,79]).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation