We introduce the +Spicy mapping system. The system is based on a number of novel algorithms that contribute to increase the quality and expressiveness of mappings. +Spicy integrates the computation of core solutions in the mapping generation process in a highly efficient way, based on a natural rewriting of the given mappings. This allows for an efficient implementation of core computations using common runtime languages like SQL or XQuery and guarantees very good performances, orders of magnitude better than those of previous algorithms. The rewriting algorithm can be applied both to mappings generated by the system, or to pre-defined mappings provided as part of the input. To do this, the system was enriched with a set of expressive primitives, so that +Spicy is the first mapping system that brings together a sophisticate and expressive mapping generation algorithm with an efficient strategy to compute core solutions.
Automated web scraping is a popular means for acquiring data from the web. Scrapers (or wrappers) are derived from either manually or automatically annotated examples, often resulting in under/over segmented data, together with missing or spurious content. Automatic repair and maintenance of the extracted data is thus a necessary complement to automatic wrapper generation. Moreover, the extracted data is often the result of a long-term data acquisition effort and thus jointly repairing wrappers together with the generated data reduces future needs for data cleaning. We study the problem of computing joint repairs for XPath-based wrappers and their extracted data. We show that the problem is NP-complete in general but becomes tractable under a few natural assumptions. Even tractable solutions to the problem are still impractical on very large datasets, but we propose an optimal approximation that proves effective across a wide variety of domains and sources. Our approach relies on encoded domain knowledge, but require no per-source supervision. An evaluation spanning more than 100k web pages from 100 different sites of a wide variety of application domains, shows that joint repairs are able to increase the quality of wrappers between 15% and 60% independently of the wrapper generation system, eliminating all errors in more than 50% of the cases.
Web scraping (or wrapping) is a popular means for acquiring data from the web. Recent advancements have made scalable wrapper-generation possible and enabled data acquisition processes involving thousands of sources. This makes wrapper analysis and maintenance both needed and challenging as no scalable tools exists that support these tasks. We demonstrate WADaR, a scalable and highly automated tool for joint wrapper and data repair. WADaR uses off-the-shelf entity recognisers to locate target entities in wrapper-generated data. Markov chains are used to determine structural repairs, that are then encoded into suitable repairs for both the data and corresponding wrappers. We show that WADaR is able to increase the quality of wrapper-generated relations between 15% and 60%, and to fully repair the corresponding wrapper without any knowledge of the original website in more than 50% of the cases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.