On the Logical Design of a Prototypical Data Lake System for Biological Resources

Che, Haoyang; Duan, Yucong

doi:10.3389/fbioe.2020.553904

Cited by 10 publications

(4 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It first extracts essential information representative of the original raw data, referred to as features, e.g., keywords and named entities. Then it provides services that add synonyms and stems to such features, while it connects them to open knowledge bases such as Google Knowledge Graph 22 , Wikidata 23 . CoreDB also annotates and groups the data sources in the data lake.…”

Section: Semantic Metadata Enrichmentmentioning

confidence: 99%

“…In essence, a data lake is a flexible, scalable data storage and management system, which ingests and stores raw data from heterogeneous sources in their original format, and provides maintenance, query processing and data analytics in an on-the-fly manner, with the help of rich metadata [116], [138], [142], [143]. Data lakes are proposed to store and manage data in many real-life use cases: Internet of things (IoT) and smart city [99], manufacturing [112], medicine [42], [55], [114], mobility service (e.g., Uber) [50], biology [23], smart grids [20], [103], air quality control [145], flights data [96], disease control, labor markets and products [13].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Data Lakes: A Survey of Functions and Systems

Hai

Koutras

Quix

et al. 2023

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Data lakes are becoming increasingly prevalent for big data management and data analytics. In contrast to traditional 'schema-on-write' approaches such as data warehouses, data lakes are repositories storing raw data in its original formats and providing a common access interface. Despite the strong interest raised from both academia and industry, there is a large body of ambiguity regarding the definition, functions and available technologies for data lakes. A complete, coherent picture of data lake challenges and solutions is still missing. This survey reviews the development, architectures, and systems of data lakes. We provide a comprehensive overview of research questions for designing and building data lakes. We classify the existing approaches and systems based on their provided functions for data lakes, which makes this survey a useful technical reference for designing, implementing and deploying data lakes. We hope that the thorough comparison of existing solutions and the discussion of open research challenges in this survey will motivate the future development of data lake research and practice.

show abstract

Section: Semantic Metadata Enrichmentmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Data Lakes: A Survey of Functions and Systems

Hai

Koutras

Quix

et al. 2023

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

show abstract

“…This method, originally proposed for the management of large transactional datasets (‘big data’), has become a generalized solution for management of heterogeneous data that offers benefits such as cost-effectiveness, high scalability, data fidelity , real-time data ingestion and fault tolerance 51 . Mature examples of this method implement tiered access layers to ensure that sensitive participant data is protected 52 . The use of a semi-structured data storage approach also facilitates the iterative development and application of rules-based and inference-based participant selection methods.…”

Section: Building Future Trial-ready Cohortsmentioning

confidence: 99%

Early-stage Alzheimer disease: getting trial-ready

et al. 2022

View full text Add to dashboard Cite

Slowing the progression of Alzheimer disease (AD) might be the greatest unmet medical need of our time. Although one AD therapeutic has received a controversial accelerated approval from the FDA, more effective and accessible therapies are urgently needed. Consensus is growing that for meaningful disease modification in AD, therapeutic intervention must be initiated at very early (preclinical or prodromal) stages of the disease. Although the methods for such early-stage clinical trials have been developed, identification and recruitment of the required asymptomatic or minimally symptomatic study participants takes many years and requires substantial funds. As an example, in the Anti-Amyloid Treatment in Asymptomatic Alzheimer’s Disease Trial (the first phase III trial to be performed in preclinical AD), 3.5 years and more than 5,900 screens were required to recruit and randomize 1,169 participants. A new clinical trials infrastructure is required to increase the efficiency of recruitment and accelerate therapeutic progress. Collaborations in North America, Europe and Asia are now addressing this need by establishing trial-ready cohorts of individuals with preclinical and prodromal AD. These collaborations are employing innovative methods to engage the target population, assess risk of brain amyloid accumulation, select participants for biomarker studies and determine eligibility for trials. In the future, these programmes could provide effective tools for pursuing the primary prevention of AD. Here, we review the lessons learned from the AD trial-ready cohorts that have been established to date, with the aim of informing ongoing and future efforts towards efficient, cost-effective trial recruitment.

show abstract

“…Even when standards are adopted, the standardized structured metadata is often unexposed and not reusable. The proliferation and fragmentation of incomplete data repositories, lack of organization of data in endless Data Lakes 32 or in repositories with insufficient metadata, and the lack of common metadata standards make it difficult to combine separate data resources into a single searchable index. While standardizing metadata will not be sufficient to fully combine research data and code from different sources and enable meta-analyses, it is nevertheless a crucial first step towards this goal.…”

Section: Introductionmentioning

confidence: 99%

Developing a standardized but extendable framework to increase the findability of infectious disease datasets

Tsueng

Cano

Bento

et al. 2022

Preprint

View full text Add to dashboard Cite

Biomedical datasets are increasing in size, stored in many repositories, and face challenges in FAIRness (findability, accessibility, interoperability, reusability). As a Consortium of infectious disease researchers from 15 Centers, we aim to adopt Open Science practices to promote transparency, encourage reproducibility, and accelerate research advances through data reuse. To improve FAIRness of our datasets and computational tools, we evaluated metadata standards across established biomedical data repositories. The vast majority do not adhere to a single standard, such as Schema.org, which is widely-adopted by generalist repositories. Consequently, datasets in these repositories are not findable in aggregation projects like Google Dataset Search. We alleviated this gap by creating a reusable metadata schema based on Schema.org and catalogued nearly 400 datasets and computational tools we collected. The approach is easily reusable to create schemas interoperable with community standards, but customized to a particular context. Our approach enabled data discovery, increased the reusability of datasets from a large research consortium, and accelerated research. Lastly, we discuss ongoing challenges with FAIRness beyond discoverability.

show abstract

On the Logical Design of a Prototypical Data Lake System for Biological Resources

Cited by 10 publications

References 41 publications

Data Lakes: A Survey of Functions and Systems

Data Lakes: A Survey of Functions and Systems

Early-stage Alzheimer disease: getting trial-ready

Developing a standardized but extendable framework to increase the findability of infectious disease datasets

Contact Info

Product

Resources

About