Existing works for extracting navigation objects from webpages focus on navigation menus, so as to reveal the information architecture of the site. However, web 2.0 sites such as social networks, e-commerce portals etc. are making the understanding of the content structure in a web site increasingly di cult. Dynamic and personalized elements such as top stories, recommended list in a webpage are vital to the understanding of the dynamic nature of web 2.0 sites. To be er understand the content structure in web 2.0 sites, in this paper we propose a new extraction method for navigation objects in a webpage. Our method will extract not only the static navigation menus, but also the dynamic and personalized page-speci c navigation lists. Since the navigation objects in a webpage naturally come in blocks, we rst cluster hyperlinks into di erent blocks by exploiting spatial locations of hyperlinks, the hierarchical structure of the DOM-tree and the hyperlink density.en we identify navigation objects from those blocks using the SVM classi er with novel features such as anchor text lengths etc. Experiments on real-world data sets with webpages from various domains and styles veri ed the e ectiveness of our method.
Blog is becoming an increasingly popular media for information publishing.
Besides the main content, most of blog pages nowadays also contain noisy
information such as advertisements etc. Removing these unrelated elements can
improves user experience, but also can better adapt the content to various
devices such as mobile phones. Though template-based extractors are highly
accurate, they may incur expensive cost in that a large number of template need
to be developed and they will fail once the template is updated. To address
these issues, we present a novel template-independent content extractor for
blog pages. First, we convert a blog page into a DOM-Tree, where all elements
including the title and body blocks in a page correspond to subtrees. Then we
construct subtree candidate set for the title and the body blocks respectively,
and extract both spatial and content features for elements contained in the
subtree. SVM classifiers for the title and the body blocks are trained using
these features. Finally, the classifiers are used to extract the main content
from blog pages. We test our extractor on 2,250 blog pages crawled from nine
blog sites with obviously different styles and templates. Experimental results
verify the effectiveness of our extractor.Comment: 2016 3rd International Conference on Information Science and Control
Engineering (ICISCE
In the field of public opinion analysis, sentiment analysis is an important basic research branch. Previous studies have successfully proved that the advanced transformer pre‐training model can be applied to this scenario in Uyghur and other low‐resource language scenarios. However, the majority of these studies are based on the traditional language anchor point and rely on the pre‐training model's cross‐lingual understanding ability. The Senti‐eXLM model proposed in this paper employs a method that allows for adaptively expanding the model's knowledge domain and dynamically adjusting the model for Uyghur language in order to improve the language's understanding and representation capability, thereby increasing the accuracy of text emotion analysis. Experiments on publicly available data sets demonstrate that when compared to the original model, the model's emotion classification accuracy is improved by 6.17%, the training convergence speed is increased by 27%, and the average reasoning time is increased by only 11%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.