Tag tree template for Web information and schema extraction

Ji, Xiangwen; Zeng, Jianping; Zhang, Shiyong; Wu, Chengrong

doi:10.1016/j.eswa.2010.05.027

Cited by 13 publications

(15 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Main processes include message extraction from webpages, message filtering, and message storage [17].…”

Section: A the Framework Of The Proposed Methodsmentioning

confidence: 99%

“…An automatic program which can get the pages from the Web is designed. We parse the pages using a tag-based template extraction method [17] and get the information of each post. On the other hand, we select the trading data, including stock code, closing price, and so on, from China A-share stock market between 2009 and 2012.…”

Section: A Dataset and Experiments Methodsmentioning

confidence: 99%

“…Webpage content extraction techniques are well researched, and we utilize our previous method to perform the extraction [17]. Hence one of the main problems in preprocessing is how to filter noise messages.…”

Section: B Preprocessingmentioning

confidence: 99%

See 2 more Smart Citations

Identification of Opinion Leaders Based on User Clustering and Sentiment Analysis

Duan

Zeng

Luo

2014

2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)

Self Cite

View full text Add to dashboard Cite

Opinion leaders play an important role in influencing topics of discussion among a group of persons. Hence, identification of opinion leaders has receive recent attention. Specifically, discovering opinion leaders in a Web-based stock message board might be valuable for many investors. Current methods for finding opinion leaders mainly concentrate on a graph of user connections, and thus leads to large amount of computation. on the other hand, opinions in user message are usually ignored so that the effectiveness in finding opinion leaders is very limited. In the paper, a new method is proposed to recognize opinion leaders in Web-based stock message boards. We combine clustering algorithm and sentiment analysis to address the two problems in current methods. Features of user activities are calculated based on messages posted on the board, then clustering algorithm is applied to the user data and generate clusters which contain potential opinion leaders. Next, we employ sentiment analysis to candidates and associate the sentiment with the actual price movement trend. By this means, opinion leaders can be well discovered since good ability in analyzing stock market is considered as skills of Influential users. Comparative experiments on a data set which contains real discussions and stock messages are conducted and the effectiveness of the proposed method is evaluated.

show abstract

“…Main processes include message extraction from webpages, message filtering, and message storage [17].…”

Section: A the Framework Of The Proposed Methodsmentioning

confidence: 99%

Section: A Dataset and Experiments Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Identification of Opinion Leaders Based on User Clustering and Sentiment Analysis

Duan

Zeng

Luo

2014

2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)

Self Cite

View full text Add to dashboard Cite

show abstract

“…On the contrary, more different rules are encouraged to use when facing different tasks. In addition, two third-party tools can function together: HTML tidy [3] and HTML Parser [7]. The former is a proposal that is intended to preprocess web documents by fixing their HTML code and converting it into XHTML.…”

Section: Discussionmentioning

confidence: 99%

An Analysis of Characters and Structures of Web Pages Based on Regular Expressions

2014

Proceedings of the 3rd International Conference on Computer Science and Service System

View full text Add to dashboard Cite

“…Ji et al [19] proposed a tag tree algorithm, in which they detected and removed the shared part among web pages with the same template, and then the main text is retained. Also some other methods extract the knowledge with Regex rules from the HTML pages.…”

Section: Related Workmentioning

confidence: 99%

A Two-Step Resume Information Extraction Algorithm

Chen

Zhang

Niu

2018

Mathematical Problems in Engineering

View full text Add to dashboard Cite

With the rapid growth of Internet-based recruiting, there are a great number of personal resumes among recruiting systems. To gain more attention from the recruiters, most resumes are written in diverse formats, including varying font size, font colour, and table cells. However, the diversity of format is harmful to data mining, such as resume information extraction, automatic job matching, and candidates ranking. Supervised methods and rule-based methods have been proposed to extract facts from resumes, but they strongly rely on hierarchical structure information and large amounts of labelled data, which are hard to collect in reality. In this paper, we propose a two-step resume information extraction approach. In the first step, raw text of resume is identified as different resume blocks. To achieve the goal, we design a novel feature, Writing Style, to model sentence syntax information. Besides word index and punctuation index, word lexical attribute and prediction results of classifiers are included in Writing Style. In the second step, multiple classifiers are employed to identify different attributes of fact information in resumes. Experimental results on a real-world dataset show that the algorithm is feasible and effective.

show abstract

Tag tree template for Web information and schema extraction

Cited by 13 publications

References 18 publications

Identification of Opinion Leaders Based on User Clustering and Sentiment Analysis

Identification of Opinion Leaders Based on User Clustering and Sentiment Analysis

An Analysis of Characters and Structures of Web Pages Based on Regular Expressions

A Two-Step Resume Information Extraction Algorithm

Contact Info

Product

Resources

About