Companion Proceedings of the Web Conference 2020 2020
DOI: 10.1145/3366424.3383547
|View full text |Cite
|
Sign up to set email alerts
|

Boilerplate Removal using a Neural Sequence Labeling Model

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 20 publications
(21 citation statements)
references
References 13 publications
0
21
0
Order By: Relevance
“…We compare SemText with BoilerNet [3], Web2Text [17], and BoilerPipe [13], the best models so far, under the measures of average precision and recall. To provide fair comparison, we train and test these models as we train SemText using the same combined dataset and fine-tune them using the same development data of CleanEval and GoodTrends (except BoilerPipe that is not built for fine-tuning).…”
Section: Comparison Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…We compare SemText with BoilerNet [3], Web2Text [17], and BoilerPipe [13], the best models so far, under the measures of average precision and recall. To provide fair comparison, we train and test these models as we train SemText using the same combined dataset and fine-tune them using the same development data of CleanEval and GoodTrends (except BoilerPipe that is not built for fine-tuning).…”
Section: Comparison Resultsmentioning
confidence: 99%
“…To avoid handcrafted features, deep learning models have recently been used to detect boilerplate. BoilerNet [3], for example, is a neural sequence-labeling model that represents each text block as a vector, encoding both HTML tags and words in the text block. Each index in the vector indicates the token count for a specific tag or word.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…There are a number of other tasks that are related to semi-structured information extraction. Boilerplate removal (Leonhardt et al, 2020) attempts to remove the unrelated elements of the page, e.g. advertising or navigation, with a binary classifier.…”
Section: Related Workmentioning
confidence: 99%