Metadata Extraction and Management in Data LakesWith GEMMS

Quix, Christoph; Hai, Rihan; Vatov, Ivan

doi:10.7250/csimq.2016-9.04

Cited by 45 publications

(43 citation statements)

References 19 publications

(25 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, to ensure data accessibility, exploration, and exploitation, an efficient and effective metadata system becomes an indispensible component in data lakes (Quix et al, 2016). Yet, most of the research work on data lakes still concentrate on structured data, or semi-structured data only (Farid et al, 2016;Farrugia et al, 2016;Madera and Laurent, 2016;Quix et al, 2016;Klettke et al, 2017). So far, unstructured data have not received enough consideration in the relevant research literature, while more often than not unstructured heterogeneous data occur frequently (Miloslavskaya and Tolstoy, 2016).…”

Section: Related Workmentioning

confidence: 99%

On the Logical Design of a Prototypical Data Lake System for Biological Resources

Che

Duan

2020

Front. Bioeng. Biotechnol.

View full text Add to dashboard Cite

Biological resources are multifarious encompassing organisms, genetic materials, populations, or any other biotic components of ecosystems, and fine-grained data management and processing of these diverse types of resources proposes a tremendous challenge for both researchers and practitioners. Before the conceptualization of data lakes, former big data management platforms in the research fields of computational biology and biomedicine could not deal with many practical data management tasks very well. As an effective complement to those previous systems, data lakes were devised to store voluminous, varied, and diversely structured or unstructured data in their native formats, for the sake of various analyses like reporting, modeling, data exploration, knowledge discovery, data visualization, advanced analysis, and machine learning. Due to their intrinsic traits, data lakes are thought to be ideal technologies for processing of hybrid biological resources in the format of text, image, audio, video, and structured tabular data. This paper proposes a method for constructing a practical data lake system for processing multimodal biological data using a prototype system named ProtoDLS, especially from the explainability point of view, which is indispensable to the rigor, transparency, persuasiveness, and trustworthiness of the applications in the field. ProtoDLS adopts a horizontal pipeline to ensure the intra-component explainability factors from data acquisition to data presentation, and a vertical pipeline to ensure the inner-component explainability factors including mathematics, algorithm, execution time, memory consumption, network latency, security, and sampling size. The dual mechanism can ensure the explainability guarantees on the entirety of the data lake system. ProtoDLS proves that a single point of explainability cannot thoroughly expound the cause and effect of the matter from an overall perspective, and adopting a systematic, dynamic, and multisided way of thinking and a system-oriented analysis method is critical when designing a data processing system for biological resources.

show abstract

Section: Related Workmentioning

confidence: 99%

On the Logical Design of a Prototypical Data Lake System for Biological Resources

Che

Duan

2020

Front. Bioeng. Biotechnol.

View full text Add to dashboard Cite

show abstract

“…We only have partial solutions in the literature. Some works concentrate on the detection of relationships between different datasets [1,9,27]. Some other work focus on the extraction of metadata for unstructured data (mostly textual data) [27,29].…”

Section: Metadatamentioning

confidence: 99%

Data Lakes: Trends and Perspectives

Ravat

Yan

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

As a relatively new concept, data lake has neither a standard definition nor an acknowledged architecture. Thus, we study the existing work and propose a complete definition and a generic and extensible architecture of data lake. What's more, we introduce three future research axes in connection with our health-care Information Technology (IT) activities. They are related to (i) metadata management that consists of intra-and inter-metadata, (ii) a unified ecosystem for companies' data warehouses and data lakes and (iii) data lake governance.

show abstract

“…Such data-intensive processing environments are hard to manage [9] [78] as the data lifecycle inside them is so complicated. Given a data product, tracing its sources and finding all the processing steps applied on it is challenging.…”

Section: Introductionmentioning

confidence: 99%

Big Provenance Stream Processing for Data Intensive Computations

Suriarachchi

Withana

Plale

2018

2018 IEEE 14th International Conference on E-Science (E-Science)

View full text Add to dashboard Cite

This dissertation is a result of an effort over many years. There are so many people who helped me in various ways during this endeavor. Without their generous support and encouragement, this work would not have been possible. First of all, I am so grateful to my Ph.D. advisor, Prof. Beth Plale for her invaluable support, guidance, and encouragement throughout my Ph.D. Her research experience over many years across multiple areas of Computer Science helped me in many ways to solve hard research problems and to successfully present them as publications. In addition to that, she was so kind to me and my family during our hard times. I am truly honored to have worked with her throughout my Ph.D. studies. I would like to thank my research committee members Prof. David Leake, Prof. Ryan Newton and Prof. Judy Qiu for their guidance and advice on my qualifying exams, thesis proposal, and final dissertation. I should thank all professors at the School of Informatics, Computing, and Engineering from whom I took a number of courses which helped immensely to improve my knowledge and skills.

show abstract

Metadata Extraction and Management in Data LakesWith GEMMS

Cited by 45 publications

References 19 publications

On the Logical Design of a Prototypical Data Lake System for Biological Resources

On the Logical Design of a Prototypical Data Lake System for Biological Resources

Data Lakes: Trends and Perspectives

Big Provenance Stream Processing for Data Intensive Computations

Contact Info

Product

Resources

About