Abstract:Usually, in addition to the main content, web pages contain additional information in the form of noise, such as navigation elements, sidebars and advertisements. This kind of noise has nothing to do with the main content, it will affect the tasks of data mining and information retrieval so that the sensor will be damaged by the wrong data and interference noise. Because of the diversity of web page structure, it is a challenge to detect relevant information and noise in order to improve the true reliability o… Show more
“…The first 2 phases remove the primary and secondary "noises", while the third phase extracts the main content using a weighted block score mechanism. Gong et al [45] developed a text extraction technique that combines a site-level noise reduction based on hashtree with a page-level noise reduction based on linked clusters. This combination eliminates noise in web articles.…”
“…The first 2 phases remove the primary and secondary "noises", while the third phase extracts the main content using a weighted block score mechanism. Gong et al [45] developed a text extraction technique that combines a site-level noise reduction based on hashtree with a page-level noise reduction based on linked clusters. This combination eliminates noise in web articles.…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.