JSON Schema Inference Approaches

The popularity of JSON as a data interchange format resulted in big amounts of datasets available for processing. Users would like to analyze this data using SQL queries but existing distributed systems limit their users to only two specific formats, JSONLine and GeoJSON. The complexity of JSON schema makes it challenging to parse arbitrary files in a modern distributed system while producing records with unified schema that can be processed with SQL. To address these challenges, this paper introduces dsJSON, a state-of-the-art distributed JSON processor that overcomes limitations in existing systems and scales to big and complex data. dsJSON introduces the projection tree, a novel data structure that applies selective parsing of nested attributes to produce records that are ready for SQL processors. The key objective of the projection tree is to parse a big JSON file in parallel to produce records with a unified schema that can be processed with SQL. dsJSON is integrated into SparkSQL which enables users to run arbitrary SQL queries on complex JSON files. It also pushes projection and filter down into the parser for full integration between the parser and the processor. Experiments on up-to two terabytes of real data show that dsJSON performs several times faster than existing systems. It can also efficiently parse extremely large files not supported by existing distributed parsers

show abstract

dsJSON: A Distributed SQL JSON Processor

Saeedan

Eldawy

Zhao

2023

Proc. ACM Manag. Data

View full text Add to dashboard Cite

show abstract

A universal approach for multi-model schema inference

2022

View full text Add to dashboard Cite

The variety feature of Big Data, represented by multi-model data, has brought a new dimension of complexity to all aspects of data management. The need to process a set of distinct but interlinked data models is a challenging task. In this paper, we focus on the problem of inference of a schema, i.e., the description of the structure of data. While several verified approaches exist in the single-model world, their application for multi-model data is not straightforward. We introduce an approach that ensures inference of a common schema of multi-model data capturing their specifics. It can infer local integrity constraints as well as intra- and inter-model references. Following the standard features of Big Data, it can cope with overlapping models, i.e., data redundancy, and it is designed to process efficiently significant amounts of data.To the best of our knowledge, ours is the first approach addressing schema inference in the world of multi-model databases.

show abstract

Schema Extraction in NoSQL Databases: A Systematic Literature Review

Belefqih,

Zellou,

Berquedich

2024

RACSC

View full text Add to dashboard Cite

Introduction: Nowadays, NoSQL databases have taken on an increasingly important role in the storage of massive data within companies. Due to a common property called schema-less, NoSQL databases offer great flexibility, particularly for the storage of data in different formats. However, despite their success in data storage, schema-less databases are a major obstacle in areas requiring precise knowledge of this schema, especially in the field of data integration. Method: This study presents a Systematic Literature Review (SLR) to explore, evaluate, and discuss relevant existing research and endeavors using novel schema extraction approaches. Furthermore, we conducted this study using a well-defined methodology to examine and study the problem of schema extraction from NoSQL databases. Results: Our research results highlight and emphasize the scheme extraction approaches and provide knowledge to researchers and practitioners by proposing schema extraction approaches and their limitations, which contributes to inventing new, more efficient approaches. Conclusion: In our future work, inspired by the recent advances in quantum computing and the emergence of post-quantum cryptography (PQC), we aim to propose a schema extraction approach that blends cutting-edge technologies with a strong focus on database security.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

JSON Schema Inference Approaches

Cited by 3 publications

References 14 publications

dsJSON: A Distributed SQL JSON Processor

dsJSON: A Distributed SQL JSON Processor

A universal approach for multi-model schema inference

Schema Extraction in NoSQL Databases: A Systematic Literature Review

Contact Info

Product

Resources

About