Blog is becoming an increasingly popular media for information publishing.
Besides the main content, most of blog pages nowadays also contain noisy
information such as advertisements etc. Removing these unrelated elements can
improves user experience, but also can better adapt the content to various
devices such as mobile phones. Though template-based extractors are highly
accurate, they may incur expensive cost in that a large number of template need
to be developed and they will fail once the template is updated. To address
these issues, we present a novel template-independent content extractor for
blog pages. First, we convert a blog page into a DOM-Tree, where all elements
including the title and body blocks in a page correspond to subtrees. Then we
construct subtree candidate set for the title and the body blocks respectively,
and extract both spatial and content features for elements contained in the
subtree. SVM classifiers for the title and the body blocks are trained using
these features. Finally, the classifiers are used to extract the main content
from blog pages. We test our extractor on 2,250 blog pages crawled from nine
blog sites with obviously different styles and templates. Experimental results
verify the effectiveness of our extractor.Comment: 2016 3rd International Conference on Information Science and Control
Engineering (ICISCE