Proceedings of the Fifth ACM/IEEE Workshop on Hot Topics in Web Systems and Technologies 2017
DOI: 10.1145/3132465.3133840
|View full text |Cite
|
Sign up to set email alerts
|

Extracting web information using representation patterns

Abstract: Feeding decision support systems with Web information typically requires sifting through an unwieldy amount of information that is available in human-friendly formats only. Our focus is on a scalable proposal to extract information from semi-structured documents in a structured format, with an emphasis on it being scalable and open. By semi-structured we mean that it must focus on information that is rendered using regular formats, not free text; by scalable, we mean that the system must require a minimum amou… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
4
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 20 publications
(10 reference statements)
0
4
0
Order By: Relevance
“…It allows to check them on a collection of well-known datasets and allows to compare the effectiveness results as homogeneously as possible and to rank them as automatically as possible. However, our recent experience with devising new information extractors (Jiménez & Corchuelo, 2016a, 2016bJiménez et al, 2021Jiménez et al, , 2020Roldán et al, 2017Roldán et al, , 2020Roldán et al, , 2021 reveals that it can be further improved to take some additional issues into account, namely: (a) whether the validation datasets are completely or partially annotated; (b) whether they contain record values or not and how their structure is taken into account to compute the effectiveness measures; and (c) how the matchings amongst the annotations and the extractions are computed.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…It allows to check them on a collection of well-known datasets and allows to compare the effectiveness results as homogeneously as possible and to rank them as automatically as possible. However, our recent experience with devising new information extractors (Jiménez & Corchuelo, 2016a, 2016bJiménez et al, 2021Jiménez et al, , 2020Roldán et al, 2017Roldán et al, , 2020Roldán et al, , 2021 reveals that it can be further improved to take some additional issues into account, namely: (a) whether the validation datasets are completely or partially annotated; (b) whether they contain record values or not and how their structure is taken into account to compute the effectiveness measures; and (c) how the matchings amongst the annotations and the extractions are computed.…”
Section: Related Workmentioning
confidence: 99%
“…We experimented with four web information extractors, namely: (a) Wien (Kushmerick et al, 1997), which is a classical supervised proposal that learns the delimiters around the information to be extracted; (b) Tango (Jiménez & Corchuelo, 2016a), which is a recent supervised proposal that learns first-order rules whose predicates are based on visual, structural, user-defined, and content-based features; (c) RoadRunner (Crescenzi et al, 2001), which is a classical unsupervised proposal that attempts to infer the template of several documents by comparing their shared and non-shared tokens; and (d) HotWeb (Roldán et al, 2017), 1 which is a heuristic-based proposal that attempts to identify common visual patterns to present information.…”
Section: Experimental Settingmentioning
confidence: 99%
See 1 more Smart Citation
“…Early research, that deals with the automatic analysis of web pages, analyze the pages' DOM (Document Object Model) tree, to extract the HTML (Hypertext Markup Language) tags that potentially contain useful information [28,29]. This process is only possible if the webpage structure is known [30,31]. The problem is therefore the same, since it involves human intervention to analyze the structure of the page, inducing potential errors and a slow indexing process [32].…”
Section: Related Workmentioning
confidence: 99%