On extracting data from tables that are encoded using HTML

Roldán, J.; Jiménez, Patricia; Corchuelo, Rafael

doi:10.1016/j.knosys.2019.105157

Cited by 17 publications

(39 citation statements)

References 72 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…One of the more general tasks in web content extraction is eliminating trivial content elements. In most cases, it is achieved by identifying the web page template and gathering meaningful content only [9]. Another more specific case-data extraction form tables or lists presented on the web page [10].…”

Section: Approaches and Problems For Automated Website Content Block Identificationmentioning

confidence: 99%

Multi-Purpose Dataset of Webpages and Its Content Blocks: Design and Structure Validation

Griazev

Ramanauskaitė

2021

Applied Sciences

View full text Add to dashboard Cite

The need for automated data extraction is continuously growing due to the constant addition of information to the worldwide web. Researchers are developing new data extraction methods to achieve increased performance compared to existing methods. Comparing algorithms to evaluate their performance is vital when developing new solutions. Different algorithms require different datasets to test their performance due to the various data extraction approaches. Currently, most datasets tend to focus on a specific data extraction approach. Thus, they generally lack the data that may be useful for other extraction methods. That leads to difficulties when comparing the performance of algorithms that are vastly different in their approach. We propose a dataset of web page content blocks that includes various data points to counter this. We also validate its design and structure by performing block labeling experiments. Web developers of varying experience levels labeled multiple websites presented to them. Their labeling results were stored in the newly proposed dataset structure. The experiment proved the need for proposed data points and validated dataset structure suitability for multi-purpose dataset design.

show abstract

Section: Approaches and Problems For Automated Website Content Block Identificationmentioning

confidence: 99%

Multi-Purpose Dataset of Webpages and Its Content Blocks: Design and Structure Validation

Griazev

Ramanauskaitė

2021

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Edozein HTML dokumentutatik informazioa erraz atera daiteke, dokumentuaren elementu bakoitza dagokion etiketaren bidez identifikatuta dagoelako [4,5,6]. Horrela, erabiliko den dokumentuaren egitura aztertu ostean, web-scraperra kodetu eta horrek edozein elementu dagokion etiketaren bidez bila dezake.…”

Section: Web-scraping-a Eta Haren Erabileraunclassified

Web-Scraping Teknikan Oinarritutako Azpiegitura Informatikoak. Xerka Online eta Minerva aplikazioak

Gauna

Barona

Fernandez-Gamiz

2021

EKAIA

View full text Add to dashboard Cite

Erabiltzailearen zereginak errazten dituzten sistema informatiko ugari erabiltzen dira, bai arlo profesionalean, baita arlo pertsonalean ere. Hala ere, kasu batzuetan erabiltzaileen beharren eta sistema informatikoak eskaintzen dutenaren arteko distantzia handia da. Artikulu honetan web-scraping teknikarekin sortutako bi azpiegitura informatiko deskribatu dira, jatorrizko beste azpiegitura batzuen funtzionalitatea hobetu dutenak. Alde batetik, Xerka Online aplikazioak ikerlarien curriculum vitaearen (CVaren) sortze- eta mantentze-lana errazten du, ikerlariek egin behar izaten duten ataza nagusia modu automatizatuan eginez: argitalpenak bilatu eta horiei dagozkien kalitate adierazle (eragin-faktore eta aipamen kopuru) eguneratuak ezarri. Minerva aplikazioak, ordea, Vitoria-Gasteizko Ingeniaritza Eskolan egiten diren kalitate-txostenak kudeatzen ditu. Horretarako, Euskal Herriko Unibertsitateko (UPV/EHUko) GAUR web-aplikaziotik automatikoki jaisten ditu itxitako aktak, jarritako kalifikazioen estatistikak kalkulatzen ditu, eta maila ezberdinetan egiten diren txostenak batzen ditu. Bi aplikazioen abantaila nagusiak lan horiek egiteko behar den denboraren eta giza-akatsen murrizpena dira.

show abstract

“…For example, we expect different words related to camera resolution such as "MP", "resolution" or "megapixels" to have similar embedding vectors. The use of property values provides additional information that is not tied to the name of a property, and makes the proposal applicable to scenarios in which the properties do not have meaningful names, e.g., identifiers that are automatically generated by information extraction approaches [12]. The use of machine learning helps use these features in a smart way, learning what features are more important and how they must be combined, which is of great relevance when it comes to word embeddings, since they can have a high number of components that would make setting manual weights and similarity thresholds very difficult.…”

Section: Shopm Ani a I Nmentioning

confidence: 99%

“…Furthermore, in some contexts the name of the properties may be unknown or only a generic identifier. For example, information extraction techniques may identify a piece of text as an instance, but not be able to infer a label with its property name [12]. In these cases, no features can be computed from the property names, and only these instance features enable matching.…”

Section: Featuresmentioning

confidence: 99%

LEAPME: Learning-based Property Matching with Embeddings

Ayala¹,

Hernández²,

Ruiz³

et al. 2020

Preprint

View full text Add to dashboard Cite

Data integration tasks such as the creation and extension of knowledge graphs involve the fusion of heterogeneous entities from many sources. Matching and fusion of such entities require to also match and combine their properties (attributes). However, previous schema matching approaches mostly focus on two sources only and often rely on simple similarity measurements. They thus face problems in challenging use cases such as the integration of heterogeneous product entities from many sources.We therefore present a new machine learning-based property matching approach called LEAPME (LEArning-based Property Matching with Embeddings) that utilizes numerous features of both property names and instance values. The approach heavily makes use of word embeddings to better utilize the domain-specific semantics of both property names and instance values. The use of supervised machine learning helps exploit the predictive power of word embeddings.Our comparative evaluation against five baselines for several multi-source datasets with real-world data shows the high effectiveness of LEAPME. We also show that our approach is even effective when training data from another domain (transfer learning) is used.

show abstract

On extracting data from tables that are encoded using HTML

Cited by 17 publications

References 72 publications

Multi-Purpose Dataset of Webpages and Its Content Blocks: Design and Structure Validation

Multi-Purpose Dataset of Webpages and Its Content Blocks: Design and Structure Validation

Web-Scraping Teknikan Oinarritutako Azpiegitura Informatikoak. Xerka Online eta Minerva aplikazioak

LEAPME: Learning-based Property Matching with Embeddings

Contact Info

Product

Resources

About