2020
DOI: 10.1109/access.2020.3024194
|View full text |Cite
|
Sign up to set email alerts
|

VB-PTC: Visual Block Multi-Record Text Extraction Based on Sensor Network Page Type Conversion

Abstract: Usually, in addition to the main content, web pages contain additional information in the form of noise, such as navigation elements, sidebars and advertisements. This kind of noise has nothing to do with the main content, it will affect the tasks of data mining and information retrieval so that the sensor will be damaged by the wrong data and interference noise. Because of the diversity of web page structure, it is a challenge to detect relevant information and noise in order to improve the true reliability o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
0
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(1 citation statement)
references
References 42 publications
0
0
0
Order By: Relevance
“…The first 2 phases remove the primary and secondary "noises", while the third phase extracts the main content using a weighted block score mechanism. Gong et al [45] developed a text extraction technique that combines a site-level noise reduction based on hashtree with a page-level noise reduction based on linked clusters. This combination eliminates noise in web articles.…”
Section: Related Workmentioning
confidence: 99%
“…The first 2 phases remove the primary and secondary "noises", while the third phase extracts the main content using a weighted block score mechanism. Gong et al [45] developed a text extraction technique that combines a site-level noise reduction based on hashtree with a page-level noise reduction based on linked clusters. This combination eliminates noise in web articles.…”
Section: Related Workmentioning
confidence: 99%