2020
DOI: 10.1016/j.knosys.2019.105157
|View full text |Cite
|
Sign up to set email alerts
|

On extracting data from tables that are encoded using HTML

Abstract: Tables are a common means to display data in human-friendly formats. Many authors have worked on proposals to extract those data back since this has many interesting applications. In this article, we summarise and compare many of the proposals to extract data from tables that are encoded using HTML and have been published between 2000 and 2018. We first present a vocabulary that homogenises the terminology used in this field; next, we use it to summarise the proposals; finally, we compare them side by side. Ou… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
38
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 17 publications
(39 citation statements)
references
References 72 publications
0
38
0
1
Order By: Relevance
“…One of the more general tasks in web content extraction is eliminating trivial content elements. In most cases, it is achieved by identifying the web page template and gathering meaningful content only [9]. Another more specific case-data extraction form tables or lists presented on the web page [10].…”
Section: Approaches and Problems For Automated Website Content Block Identificationmentioning
confidence: 99%
“…One of the more general tasks in web content extraction is eliminating trivial content elements. In most cases, it is achieved by identifying the web page template and gathering meaningful content only [9]. Another more specific case-data extraction form tables or lists presented on the web page [10].…”
Section: Approaches and Problems For Automated Website Content Block Identificationmentioning
confidence: 99%
“…Edozein HTML dokumentutatik informazioa erraz atera daiteke, dokumentuaren elementu bakoitza dagokion etiketaren bidez identifikatuta dagoelako [4,5,6]. Horrela, erabiliko den dokumentuaren egitura aztertu ostean, web-scraperra kodetu eta horrek edozein elementu dagokion etiketaren bidez bila dezake.…”
Section: Web-scraping-a Eta Haren Erabileraunclassified
“…For example, we expect different words related to camera resolution such as "MP", "resolution" or "megapixels" to have similar embedding vectors. The use of property values provides additional information that is not tied to the name of a property, and makes the proposal applicable to scenarios in which the properties do not have meaningful names, e.g., identifiers that are automatically generated by information extraction approaches [12]. The use of machine learning helps use these features in a smart way, learning what features are more important and how they must be combined, which is of great relevance when it comes to word embeddings, since they can have a high number of components that would make setting manual weights and similarity thresholds very difficult.…”
Section: Shopm Ani a I Nmentioning
confidence: 99%
“…Furthermore, in some contexts the name of the properties may be unknown or only a generic identifier. For example, information extraction techniques may identify a piece of text as an instance, but not be able to infer a label with its property name [12]. In these cases, no features can be computed from the property names, and only these instance features enable matching.…”
Section: Featuresmentioning
confidence: 99%