Proceedings of the Third ACM International Conference on Web Search and Data Mining 2010
DOI: 10.1145/1718487.1718542
|View full text |Cite
|
Sign up to set email alerts
|

Boilerplate detection using shallow text features

Abstract: In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly.In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, stateof-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
232
0
2

Year Published

2011
2011
2018
2018

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 339 publications
(234 citation statements)
references
References 19 publications
0
232
0
2
Order By: Relevance
“…Publication date is 89.4% accurate (253 misses) and post author is 85.4% (264 misses). Table 1 summarizes the above results and presents the accuracy of Boilerpipe (77.4%) [8] (Boilerpipe is presented in detail in Section 4). Concerning the extraction of the title using Boilerpipe, the captured values are considered wrong, since the tool extracts the title of the HTML document.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Publication date is 89.4% accurate (253 misses) and post author is 85.4% (264 misses). Table 1 summarizes the above results and presents the accuracy of Boilerpipe (77.4%) [8] (Boilerpipe is presented in detail in Section 4). Concerning the extraction of the title using Boilerpipe, the captured values are considered wrong, since the tool extracts the title of the HTML document.…”
Section: Discussionmentioning
confidence: 99%
“…article) of a web page. The open source Boilerpipe system is state-of-the-art and one of the most prominent tools for analysing the content of a web page [8]. Boilerpipe makes use of the structural features, such as HTML tags or sequences of tags forming subtrees, and employs methods that stem from quantitative linguistics.…”
Section: Discussion and Related Workmentioning
confidence: 99%
“…This approach has been less studied because rendering webpages for classification is a computational expensive operation [15].…”
Section: Related Workmentioning
confidence: 99%
“…Recently, researchers focused on improving off-the-shelf tools for identifying many languages (Lui and Baldwin, 2012), discriminating between similar languages where standard tools fail (Tiedemann and Ljubešić, 2012), identifying documents written in multiple languages and identifying the languages in such multilingual documents (Lui et al, 2014). Text quality in automatically constructed web corpora is quite an underresearched topic, with the exception of boilerplate removal / content extraction approaches that deal with this problem implicitly (Baroni et al, 2008;Kohlschütter et al, 2010), but quite drastically, by removing all content that does not conform to the criteria set. A recent approach to assessing text quality in web corpora in an unsupervised manner (Schäfer et al, 2013) calculates the weighted mean and standard deviation of n most frequent words in a corpus sample and measures how much a specific document deviates from the estimated means.…”
Section: Related Workmentioning
confidence: 99%