Data extraction from the web pages is the process of analyzing and retrieving relevant data out of the data sources (usually unstructured or poorly structure) in a specific pattern for further processing, involves addition of metadata and data integration details for further process in the data workflow. This survey describes overview of the different web data extraction and data alignment techniques. Extraction techniques are DeLa, DEPTA, ViPER, and ViNT. Data alignment techniques are Pairwise QRR alignment, Holistic alignment, Nested structure processing. Query Result pages are generated by using Web database based on Users Query. The data from these query result pages should be automatically extracted which is very important for many applications, such as data integration, which are needed to cooperate with multiple web databases. New method is proposed for data extraction t that combines both tag and value similarity. It automatically extracts data from query result pages by first identifying and segmenting the query result records (QRRs) in the query result pages and then aligning the segmented QRRs into a table. In which the data values from the same attribute are put into the same column. Data region identification method identify the noncontiguous QRRs that have the same parents according to their tag similarities. Specifically, we propose new techniques to handle the case when the QRRs are not contiguous, which may be due to presence of auxiliary information, such as a comment, recommendation or advertisement, and for handling any nested structure that may exist in the QRRs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.