Persistent Data Staging Services for Data Intensive In-situ Scientific Workflows

Romanus, Melissa; Zhang, Fan; Jin, Tong; Sun, Qian; Bui, Hoang; Parashar, Manish; Choi, Jong Youl; Janhunen, Saloman; Hager, Robert; Klasky, Scott; Chang, C. S.; Rodero, Iván

doi:10.1145/2912152.2912157

Cited by 8 publications

(2 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Programming models: Existing big data toolkits (e.g., Hadoop [12], Spark [28], AllPairs [19], DataSpaces [25], etc) already provide an extensive collection of ready-to-use functionalities. It is critical that students understand the underlying programming paradigms implemented in these functionalities.…”

Section: Learning Activitiesmentioning

confidence: 99%

Sustainable and Scalable Setup for Teaching Big Data Computing

Ngo¹,

Bui²

2023

JOCSE

View full text Add to dashboard Cite

As more students want to pursue a career in big data analytics and data science, big data education has become a focal point in many colleges and universities' curricula. There are many challenges when it comes to teaching and learning big data in a classroom setting. One of the biggest challenges is to prepare big data infrastructure to provide meaningful hands-on experience to students. Setting up necessary distributed computing resource is a delicate act for instructors and system administrators because there is no one size fit all solutions. In this paper, we propose an approach that facilitates the creation of the computing environment on both personal computers and public cloud resources. This combined approach meet different needs and can be used in an educational setting to facilitate different big data learning activities. We discuss and reflect on our experience using these systems in teaching undergraduate and graduate courses.

show abstract

Section: Learning Activitiesmentioning

confidence: 99%

Sustainable and Scalable Setup for Teaching Big Data Computing

Ngo¹,

Bui²

2023

JOCSE

View full text Add to dashboard Cite

show abstract

“…A common practice is to use a separate or external parallel computer to prepare data for subsequent processing, but this strategy not only limits the amount of data that can be saved, but also turns I/O into a performance bottleneck when using a large parallel system. The most plausible solution for the exascale data problem is to reduce or transform the data in-situ [17] to perform subsequent processing locally or even while it is being generated.…”

Section: In-situ Processingmentioning

confidence: 99%

dataClay : next generation object storage

Martí Fraiz

View full text Add to dashboard Cite

Existing solutions for data sharing are not fully compatible with multi-provider contexts. Traditionally, providers offer their datasets through hermetic Data Services with restricted APIs. Therefore, consumers are compelled to adapt their applications to current functionality, and their chances of contributing with their own know-how are very limited. With regard to data management, current database management systems (DBMSs) that sustain these Data Services are designed for a single-provider scenario, forcing a centralized administration conducted by the single role of the database administrator (DBA). This DBA defines the conceptual schema and the corresponding integrity constraints, and determines the external schema to be offered to the end users. The problem is that a multi-provider environment cannot assume the existence of a central role for the administration of all the datasets. In terms of data processing, the different representations of the data model at different tiers, from the application level, to the Data Service or DBMS layers; causes the applications to dedicate between 20\% and 50\% of the code to perform the proper transformations. This causes a negative impact both on developers' productivity and on the global performance of data-intensive workflows. In light of the foregoing, this thesis proposes three novel techniques that enable a data store to support a multi-provider ecosystem, facilitating the collaboration within all the players, and the development of efficient data-intensive applications. In particular, and after the convenient decentralization of the database administration, this thesis contributes to the community with: 1) the proper mechanisms to enable consumers to extend current schema and functionality without compromising providers constraints. 2) the proper mechanisms to enable any provider to define his own policies and integrity constraints in a way that will never be jeopardized. 3) the integration of a parallel programming model with the data model to drastically reduce data transformations and being designed to be compliant with near future storage devices. These contributions have been validated by means of the design and implementation of dataClay, as an example of a multi-provider data store that fulfills the defined requirements. Furthermore, regarding the first and third contributions, different performance analysis are exposed to evaluate and prove their feasibility (notice that the second contribution is merely logical). Les solucions actuals per a compartir dades no són compatibles per a contexts multi-proveïdor. Tradicionalment, els proveïdors de dades les ofereixen via Data Services hermètics amb APIs molt restringides. De manera que els consumidors per una banda es veuen obligats a adaptar les seves aplicacions a la funcionalitat actual, i d'altra banda veuen com les possibilitats de contribuir amb el seu propi know-how queden molt limitades. A nivell de gestió, els sistemes gestors de bases de dades que sostenen aquests Data Services estan dissenyats per a escenaris amb un únic proveïdor, forçant una administració centralitzada que recau en el rol de l'administrador de la base de dades o DBA. El DBA defineix les restriccions d'integritat necessàries i especifica el model extern de les dades a oferir als usuaris. El problema és que en un entorn multi-proveïdor, no podem assumir l'existència d'un únic administrador central que s'ocupi de les dades de tothom. A nivell de processament, el fet de tenir diferents representacions de les dades segons es processin a nivell aplicació, de servei, o de base de dades; fa que les aplicacions hagin de dedicar d'entre un 20 i un 50% del codi a realitzar les transformacions corresponents. Això té un impacte negatiu tan a nivell de productivitat dels programadors, com a nivell de rendiment global en aplicacions que fa un ús intensiu de les dades. Tenint en compte aquestes dificultats, aquesta tesi proposa tres nous mecanismes per fer possible que un sistema gestor de dades pugui donar suport a entorns multi-proveïdor, on es faciliti la col·laboració amb els consumidors i el desenvolupament d'aplicacions que facin un ús intensiu de les dades. En concret, partint de la descentralització de l'administració de les dades i d'un model de dades orientat a objectes, aquesta tesi contribueix a la comunitat científica amb: 1) un mecanisme per permetre que els consumidors puguin estendre el model extern de les dades i la funcionalitat oferta, sense comprometre les restriccions dels proveïdors. 2) un mecanisme per permetre que cada proveïdor pugui definir les restriccions d'integritat que cregui convenients sobre el model de les dades, i de tal manera que sempre siguin respectades independentment de l'ús que se'n faci i les extensions que hi hagi. 3) la integració d'un model de programació paral·lela amb el model de dades per millorar el rendiment de les aplicacions i la productivitat dels programadors, reduint significativament les transformacions de les dades i el codi necessari per accedir-les. Aquestes contribucions es validen per mitjà del disseny i implementació de dataClay, com a exemple de gestor de dades multi-proveïdor que compleix els requisits definits. A més, en relació a la primera i tercera contribucions, es mostren una serie d'estudis de rendiment que n'avaluen i en demostren la seva viabilitat (la segona contribució és només lògica).

show abstract