Susan Lubar scite author profile

Susan Lubar

2Publications

21Citation Statements Received

11Citation Statements Given

How they've been cited

How they cite others

Affiliations

Mitre (United States)

Publications

Order By: Most citations

Adaptive Web-page Content Identification

Gibson¹,

Wellner²,

Lubar³

2007

View full text Add to dashboard Cite

Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Webbased applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources. Categories and Subject Descriptors General TermsAlgorithms, Experimentation. KeywordsConditional random fields, content identification, maximum entropy markov models, sequence labeling. INTRODUCTIONWeb pages containing news stories also include many other pieces of extraneous information such as navigation bars, JavaScript, images and advertisements. There are a number of tasks that necessitate the extraction of just the news article from these pages. This might be done to provide input into a database or into an application such as a Natural Language tool, index for a search engine or duplicate detection tool. Another cause to extract just the news story is to re-display it on a small screen such as a cell phone or PDA. An example of identifying the embedded news article can be seen in Figure 1.Typically, content extraction is done via a hand-crafted tool targeted to handle a single web page format. This approach is brittle in that when the page format changes, the extractor is likely to break. Additionally, it is labor intensive since a new extractor must be written to handle each unique page format. In our experience, web page formats change fairly quickly and custom extractors often become obsolete a short time after they are written. Further, some websites use multiple formats concurrently and identifying each one and handling them properly makes this a complex task. As part of a larger project, we initially developed such site-specific content extractors and found them to be unworkable as a long-term solution due to the aforementioned problems.The approach described in this paper is meant to overcome these issues. The data set for this work consisted of web pages from 27 different news sites. The sites are visually similar in that they contain a news article surrounded by other information. However, the underlying HTML for creating this layout varies amongst the sites. For example, while some sites separate sections of the article content with paragraph tags, others segment the...

show abstract

Adaptive web-page content identification

Gibson

Wellner

Lubar

2007

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Susan Lubar

Adaptive Web-page Content Identification

Adaptive web-page content identification

Contact Info

Product

Resources

About