2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS) 2021
DOI: 10.1109/icspis54653.2021.9729342
|View full text |Cite
|
Sign up to set email alerts
|

Web Content Extraction by Weighing the Fundamental Contextual Rules

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 23 publications
0
2
0
Order By: Relevance
“…Besides the classification based on the technique used to build the wrapper (described in Section 2.4), block detection techniques can be further classified depending on the way in which they internally represent the web pages: (i) web pages are treated as HTML code, (ii) web pages are treated as a rendered image, and (iii) web pages are treated as a DOM tree: i. HTML-based approaches are mainly based on densitometry methods ( [79]) that use the textual information of the web page. Many of them assume that the main content on a web page contains a high text density and a low tag density.…”
Section: Block Detectionmentioning
confidence: 99%
See 1 more Smart Citation
“…Besides the classification based on the technique used to build the wrapper (described in Section 2.4), block detection techniques can be further classified depending on the way in which they internally represent the web pages: (i) web pages are treated as HTML code, (ii) web pages are treated as a rendered image, and (iii) web pages are treated as a DOM tree: i. HTML-based approaches are mainly based on densitometry methods ( [79]) that use the textual information of the web page. Many of them assume that the main content on a web page contains a high text density and a low tag density.…”
Section: Block Detectionmentioning
confidence: 99%
“…Not even the latest block detection techniques (see, e.g., [115,110,124,66,123,121,53,81]) implement another block detection phase as a preprocess. Many techniques implement simple preprocess methods such as removing nodes that surely do not have any content to extract (see, e.g., [115,90,110]) or standardizing the HTML code and precleaning it (see, e.g., [105,79]).…”
Section: Related Workmentioning
confidence: 99%