Abstract-There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this. We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach. I. INTRODUCTIONThere is currently a huge growth in the amount, variety, and velocity of data ingested in analytical data repositories. Such data are commonly called Big Data (BD). Data repositories storing such BD in their original raw-format are commonly called Data Lakes (DL) [1]. DL are characterised by having a large amount of data covering different subjects, which need to be analysed by non-experts in IT commonly called data enthusiasts [2]. To support the data enthusiast in analysing the data in the DL, there must be a data governance process which describes the content using metadata. Such process should describe the informational content of the data ingested using the least intrusive techniques. The metadata can then be exploited by the data enthusiast to discover relationships between datasets, duplicated data, and outliers which have no other datasets related to them.In this paper, we investigate the appropriate process and techniques required to manage the metadata about the informational content of the DL. We specifically focus on addressing the challenges of variety and variability of BD ingested in the DL. The metadata discovered supports data consumers in finding the required data in the large amounts of information stored inside the DL for analytical purposes [3]. Currently, information discovery to identify, locate, integrate and reengineer data consumes 70% of time spent in data analytics project [1], which clearly needs to be decreased. To handle this challenge, this paper proposes (i) a systematic process for the schema annotation of data ingested in the DL and
With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.
There is currently plenty of research concerning the effect of Enterprise Resource Planning (ERP) Systems on business performance. Previous research has shown a mixed relationship between ERP and business performance where some suggested that ERP improves performance and others found that it does not. Previous research was mainly based on quantitative methods which don't give important insightful knowledge and details. A case-study on the other hand can help identify the important contributing factors for the relationship between ERP and business performance. This paper therefore, investigates this topic by analyzing a critical case-study consisting of an Egyptian SME branch of a multinational company. The results indicate that in general many benefits in business performance were achieved after implementing the ERP as reported by the business users, but have also shown that a few benefits previously linked to ERP were not fully achieved. This indicates the positive contribution of ERP on business performance but also suggests the limited applicability of this positive relationship according to specific factors to be researched.
With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schemaon-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.