Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Webbased applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources. Categories and Subject Descriptors General TermsAlgorithms, Experimentation. KeywordsConditional random fields, content identification, maximum entropy markov models, sequence labeling. INTRODUCTIONWeb pages containing news stories also include many other pieces of extraneous information such as navigation bars, JavaScript, images and advertisements. There are a number of tasks that necessitate the extraction of just the news article from these pages. This might be done to provide input into a database or into an application such as a Natural Language tool, index for a search engine or duplicate detection tool. Another cause to extract just the news story is to re-display it on a small screen such as a cell phone or PDA. An example of identifying the embedded news article can be seen in Figure 1.Typically, content extraction is done via a hand-crafted tool targeted to handle a single web page format. This approach is brittle in that when the page format changes, the extractor is likely to break. Additionally, it is labor intensive since a new extractor must be written to handle each unique page format. In our experience, web page formats change fairly quickly and custom extractors often become obsolete a short time after they are written. Further, some websites use multiple formats concurrently and identifying each one and handling them properly makes this a complex task. As part of a larger project, we initially developed such site-specific content extractors and found them to be unworkable as a long-term solution due to the aforementioned problems.The approach described in this paper is meant to overcome these issues. The data set for this work consisted of web pages from 27 different news sites. The sites are visually similar in that they contain a news article surrounded by other information. However, the underlying HTML for creating this layout varies amongst the sites. For example, while some sites separate sections of the article content with paragraph tags, others segment the...
Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Webbased applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources. Categories and Subject Descriptors General TermsAlgorithms, Experimentation. KeywordsConditional random fields, content identification, maximum entropy markov models, sequence labeling. INTRODUCTIONWeb pages containing news stories also include many other pieces of extraneous information such as navigation bars, JavaScript, images and advertisements. There are a number of tasks that necessitate the extraction of just the news article from these pages. This might be done to provide input into a database or into an application such as a Natural Language tool, index for a search engine or duplicate detection tool. Another cause to extract just the news story is to re-display it on a small screen such as a cell phone or PDA. An example of identifying the embedded news article can be seen in Figure 1.Typically, content extraction is done via a hand-crafted tool targeted to handle a single web page format. This approach is brittle in that when the page format changes, the extractor is likely to break. Additionally, it is labor intensive since a new extractor must be written to handle each unique page format. In our experience, web page formats change fairly quickly and custom extractors often become obsolete a short time after they are written. Further, some websites use multiple formats concurrently and identifying each one and handling them properly makes this a complex task. As part of a larger project, we initially developed such site-specific content extractors and found them to be unworkable as a long-term solution due to the aforementioned problems.The approach described in this paper is meant to overcome these issues. The data set for this work consisted of web pages from 27 different news sites. The sites are visually similar in that they contain a news article surrounded by other information. However, the underlying HTML for creating this layout varies amongst the sites. For example, while some sites separate sections of the article content with paragraph tags, others segment the...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.