BlobSeer: Next-generation data management for large scale infrastructures

Nicolae, Bogdan; Antoniu, Gabriel; Bougé, Luc; Moise, Diana; Carpen-Amarie, Alexandra

doi:10.1016/j.jpdc.2010.08.004

Cited by 106 publications

(74 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally we analyze the data storage solution provided by BlobSeer [14]. This solution represents data as BLOBS taking into consideration that most data in circulation is unstructured.…”

Section: Data Storage and Aggregation Solutionmentioning

confidence: 99%

“…The service is designed to respect all the requirements and constraints imposed by data-intensive applications and utilizes multiple features of BlobSeer [14] such as data stripping, distributed metadata management and versioning-based concurrency control. The DDAS is designed to ensure scalability, fault tolerance and data retrieval performance [16].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A formal method for rule analysis and validation in distributed data aggregation service

Serbanescu¹,

Pop²,

Cristea³

et al. 2015

World Wide Web

Self Cite

View full text Add to dashboard Cite

The usage of Cloud Serviced has increased rapidly in the last years. Data management systems, behind any Cloud Service, are a major concern when it comes to scalability, flexibility and reliability due to being implemented in a distributed way. A Distributed Data Aggregation Service relying on a storage system meets these demands and serves as a repository back-end for complex analysis and automatic mining of any type of data. In this paper we continue our previous work on data management in Cloud storage. We present a formal approach to express retrieval and aggregation rules with a compact, yet powerful tool called Rule Markup Language. Our extended solution proposes a standard form to schemes and uses the tool to match the rules to the XML form of the structured data in order to obtain the unstructured entries from BlobSeer data storage system. This allows the Distributed Data Aggregation Service (DDAS) to bypass several steps when processing a retrieval request. Our new architecture is more loosely-coupled with a separate module, the new tool, used for transforming the XML entries to standard XML files which represent the final result. We model the dynamic behavior of the system using this new standard to ensure a simpler and efficient representation of the operations performed by the client while maintaining the constraints imposed by a distributed system running in the Cloud. Furthermore we prove that this method correctly performs the translation between the storage model's unstructured view of data and the client's structured objects.

show abstract

“…Finally we analyze the data storage solution provided by BlobSeer [14]. This solution represents data as BLOBS taking into consideration that most data in circulation is unstructured.…”

Section: Data Storage and Aggregation Solutionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A formal method for rule analysis and validation in distributed data aggregation service

Serbanescu¹,

Pop²,

Cristea³

et al. 2015

World Wide Web

Self Cite

View full text Add to dashboard Cite

show abstract

“…PVFS [23]) or cloud storage repositories (e.g. Amazon S3 [24]) to specialized storage systems [25] and even local storage. Local storage is particularly attractive, because it is much faster and more scalable compared to conventional approaches.…”

Section: B Architecturementioning

confidence: 99%

Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal

Nicolae

2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Self Cite

View full text Add to dashboard Cite

Abstract-With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence. For a large class of applications that run for a long time and are tightly coupled, Checkpoint-Restart (CR) is the only feasible method to survive failures. However, exploding checkpoint sizes that need to be dumped to storage pose a major scalability challenge, prompting the need to reduce the amount of checkpointing data. This paper contributes with a novel collective memory contents deduplication scheme that attempts to identify and eliminate duplicate memory pages before they are saved to storage. Unlike previous approaches that concentrate on the checkpoints of the same process, our approach identifies duplicate memory pages shared by different processes (regardless whether on the same or different node). We show both how to achieve such a global deduplication in a scalable fashion and how to leverage it effectively to optimize the data layout in such way that it minimizes I/O bottlenecks. Large scale experiments show significant reduction of storage space consumption and performance overhead compared to several state-of-art approaches, both in synthetic benchmarks and for a real life high performance computing application.

show abstract

“…Our approach is based on shadowing techniques [15], which means to offer the illusion of creating a new standalone snapshot of the file for each update to it but to physically store only the differences and manipulate metadata in such way that the aforementioned illusion is upheld. Starting from the principles introduced in [16], we propose to enable concurrent MPI processes to write their non-contiguous regions in complete isolation, without having to care about overlappings and synchronization, which is made possible by keeping data immutable: new differences are added rather than modify an existing snapshot. It is at the metadata level where the ordering is done and the overlappings are resolved in such way as to expose a snapshot of the file that looks as if all differences were applied in an arbitrary sequential order.…”

Section: Design Principlesmentioning

confidence: 99%

Efficient Support for MPI-I/O Atomicity Based on Versioning

Tran

Nicolae

Antoniu

et al. 2011

2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Self Cite

View full text Add to dashboard Cite

Abstract:We consider the challenge of building data management systems that meet an important requirement of today's data-intensive HPC applications: to provide a high I/O throughput while supporting highly concurrent data accesses. In this context, many applications rely on MPI-IO and require atomic, non-contiguous I/O operations that concurrently access shared data. In most existing implementations the atomicity requirement is often implemented through locking-based schemes, which have proven inefficient, especially for non-contiguous I/O. We claim that using a versioning-enabled storage backend has the potential to avoid expensive synchronization as exhibited by locking-based schemes, which is much more efficient. We describe a prototype implementation on top of ROMIO along this idea, and report on promising experimental results with standard MPI-IO benchmarks specifically designed to evaluate the performance of non-contiguous, overlapped I/O accesses under MPI atomicity guarantees.

show abstract

BlobSeer: Next-generation data management for large scale infrastructures

Cited by 106 publications

References 26 publications

A formal method for rule analysis and validation in distributed data aggregation service

A formal method for rule analysis and validation in distributed data aggregation service

Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal

Efficient Support for MPI-I/O Atomicity Based on Versioning

Contact Info

Product

Resources

About