Boilerplate detection using shallow text features

Kohlschütter, Christian; Fankhauser, Péter; Nejdl, Wolfgang

doi:10.1145/1718487.1718542

Cited by 339 publications

(234 citation statements)

References 19 publications

Supporting

Mentioning

232

Contrasting

Unclassified

Order By: Relevance

“…Publication date is 89.4% accurate (253 misses) and post author is 85.4% (264 misses). Table 1 summarizes the above results and presents the accuracy of Boilerpipe (77.4%) [8] (Boilerpipe is presented in detail in Section 4). Concerning the extraction of the title using Boilerpipe, the captured values are considered wrong, since the tool extracts the title of the HTML document.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

et al. 2013

View full text Add to dashboard Cite

Abstract. Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives.

show abstract

Section: Discussionmentioning

confidence: 99%

“…article) of a web page. The open source Boilerpipe system is state-of-the-art and one of the most prominent tools for analysing the content of a web page [8]. Boilerpipe makes use of the structural features, such as HTML tags or sequences of tags forming subtrees, and employs methods that stem from quantitative linguistics.…”

Section: Discussion and Related Workmentioning

confidence: 99%

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

et al. 2013

View full text Add to dashboard Cite

show abstract

“…This approach has been less studied because rendering webpages for classification is a computational expensive operation [15].…”

Section: Related Workmentioning

confidence: 99%

Site-Level Web Template Extraction Based on DOM Analysis

Alarte

Insa

Silva

et al. 2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. One of the main development resources for website engineers are Web templates. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information leads to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic web template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using an hyperlink analysis.Our implementation and experiments demonstrate the usefulness of the technique.

show abstract

“…Recently, researchers focused on improving off-the-shelf tools for identifying many languages (Lui and Baldwin, 2012), discriminating between similar languages where standard tools fail (Tiedemann and Ljubešić, 2012), identifying documents written in multiple languages and identifying the languages in such multilingual documents (Lui et al, 2014). Text quality in automatically constructed web corpora is quite an underresearched topic, with the exception of boilerplate removal / content extraction approaches that deal with this problem implicitly (Baroni et al, 2008;Kohlschütter et al, 2010), but quite drastically, by removing all content that does not conform to the criteria set. A recent approach to assessing text quality in web corpora in an unsupervised manner (Schäfer et al, 2013) calculates the weighted mean and standard deviation of n most frequent words in a corpus sample and measures how much a specific document deviates from the estimated means.…”

Section: Related Workmentioning

confidence: 99%

{bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian

Ljubešić¹,

Klubička²

2014

Proceedings of the 9th Web as Corpus Workshop (WaC-9)

View full text Add to dashboard Cite

In this paper we present the construction process of top-level-domain web corpora of Bosnian, Croatian and Serbian. For constructing the corpora we use the SpiderLing crawler with its associated tools adapted for simultaneous crawling and processing of text written in two scripts, Latin and Cyrillic. In addition to the modified collection process we focus on two sources of noise in the resulting corpora: 1. they contain documents written in the other, closely related languages that can not be identified with standard language identification methods and 2. as most web corpora, they partially contain low-quality data not suitable for the specific research and application objectives. We approach both problems by using language modeling on the crawled data only, omitting the need for manually validated language samples for training. On the task of discriminating between closely related languages we outperform the state-of-the-art Blacklist classifier reducing its error to a fourth.

show abstract

Boilerplate detection using shallow text features

Cited by 339 publications

References 19 publications

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

Site-Level Web Template Extraction Based on DOM Analysis

{bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian

Contact Info

Product

Resources

About