From NoSQL Accumulo to NewSQL Graphulo: Design and utility of graph algorithms inside a BigTable database

Hutchison, Dylan; Kepner, Jeremy; Gadepally, Vijay; Howe, Bill

doi:10.1109/hpec.2016.7761577

Cited by 12 publications

(11 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To use Graphulo through D4M, the first step is to bind to a database, requesting a Graphulo object: Graphulo have been shown to scale well to multi-node Accumulo instances [10] and outperform the client-side alternative in many cases. Our extensive performance results indicate that D4M-Graphulo can be used in cases where data size makes operations impossible to complete client-side due to memory constraints [11] [12]. We have performed significant experiments with the D4M-Graphulo tool and have compared it to numerous parallel processing paradigms.…”

Section: Graphulomentioning

confidence: 99%

D4M 3.0: Extended database and language capabilities

Milechin¹,

Gadepally

Samsi

et al. 2017

2017 IEEE High Performance Extreme Computing Conference (HPEC)

Self Cite

View full text Add to dashboard Cite

Abstract-The D4M tool was developed to address many of today's data needs. This tool is used by hundreds of researchers to perform complex analytics on unstructured data. Over the past few years, the D4M toolbox has evolved to support connectivity with a variety of new database engines, including SciDB. D4M-Graphulo provides the ability to do graph analytics in the Apache Accumulo database. Finally, an implementation using the Julia programming language is also now available. In this article, we describe some of our latest additions to the D4M toolbox and our upcoming D4M 3.0 release. We show through benchmarking and scaling results that we can achieve fast SciDB ingest using the D4M-SciDB connector, that using Graphulo can enable graph algorithms on scales that can be memory limited, and that the Julia implementation of D4M achieves comparable performance or exceeds that of the existing MATLAB® implementation.

show abstract

Section: Graphulomentioning

confidence: 99%

D4M 3.0: Extended database and language capabilities

Milechin¹,

Gadepally

Samsi

et al. 2017

2017 IEEE High Performance Extreme Computing Conference (HPEC)

Self Cite

View full text Add to dashboard Cite

show abstract

“…During scans, the user can execute arbitrary code in the form of iterators that run server-side as data streams from each partition in parallel. Iterator code can even initiate scans on or write entries to additional tables, a fact we previously exploited in the Graphulo matrix math library [20,21].…”

Section: Accumulo Implementation Of Laradbmentioning

confidence: 99%

LaraDB

Hutchison

Howe

Suciu

2017

Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond

Self Cite

View full text Add to dashboard Cite

Analytics tasks manipulate structured data with variants of relational algebra (RA) and quantitative data with variants of linear algebra (LA). The two computational models have overlapping expressiveness, motivating a common programming model that affords unified reasoning and algorithm design. At the logical level we propose Lara, a lean algebra of three operators, that expresses RA and LA as well as relevant optimization rules. We show a series of proofs that position Lara at just the right level of expressiveness for a middleware algebra: more explicit than MapReduce but more general than RA or LA. At the physical level we find that the Lara operators afford efficient implementations using a single primitive that is available in a variety of backend engines: range scans over partitioned sorted maps.To evaluate these ideas, we implemented the Lara operators as range iterators in Apache Accumulo, a popular implementation of Google's BigTable. First we show how Lara expresses a sensor quality control task, and we measure the performance impact of optimizations Lara admits on this task. Second we show that the LaraDB implementation outperforms Accumulo's native MapReduce integration on a core task involving join and aggregation in the form of matrix multiply, especially at smaller scales that are typically a poor fit for scale-out approaches. We find that LaraDB offers a conceptually lean framework for optimizing mixed-abstraction analytics tasks, without giving up fast record-level updates and scans.

show abstract

“…Solutions for big data problems generally involve distributed computation and need to take the full advantage of data locality. Therefore, instead of using an external system, performing computations inside a database system is a preferable solution [7]. One approach to perform big data computations inside a database system is using NewSQL databases [8].…”

Section: Introductionmentioning

confidence: 99%

“…These type of databases seek solutions to provide scalability of NoSQL systems while retaining the SQL guarantees (ACID properties) of relational databases. However, even though using a NewSQL database can be a good alternative, some researchers take a different approach and seek solutions based on performing big data computations inside NoSQL databases [7]. To that extent, the Graphulo library [9] realizing the kernel operations of Graph Basic Linear Algebra Subprogram (GraphBLAS) [10] in Accumulo NoSQL database is recently developed.…”

Section: Introductionmentioning

confidence: 99%

Scaling sparse matrix-matrix multiplication in the accumulo database

Demirci

Aykanat

2019

Distrib Parallel Databases

View full text Add to dashboard Cite

We propose and implement a sparse matrix-matrix multiplication (SpGEMM) algorithm running on top of Accumulo's iterator framework which enables high performance distributed parallelism. The proposed algorithm provides write-locality while ingesting the output matrix back to database via utilizing row-by-row parallel SpGEMM. The proposed solution also alleviates scanning of input matrices multiple times by making use of Accumulo's batch scanning capability which is used for accessing multiple ranges of key-value pairs in parallel. Even though the use of batch-scanning introduces some latency overheads, these overheads are alleviated by the proposed solution and by using node-level parallelism structures. We also propose a matrix partitioning scheme which reduces the total communication volume and provides a balance of workload among servers. The results of extensive experiments performed on both real-world and synthetic sparse matrices show that the proposed algorithm scales significantly better than the outer-product parallel SpGEMM algorithm available in the Graphulo library. By applying the proposed matrix partitioning, the performance of the proposed algorithm is further improved considerably. Keywords Databases • NoSQL • Accumulo • Graphulo • Parallel and distributed computing • Sparse matrices • Sparse matrix-matrix multiplication • SpGEMM • Matrix partitioning • Graph partitioning • Data locality This work is partially supported by the Scientific and Technological Research Council of Turkey (TUBITAK) under project EEEAG-115E512.

show abstract

From NoSQL Accumulo to NewSQL Graphulo: Design and utility of graph algorithms inside a BigTable database

Cited by 12 publications

References 33 publications

D4M 3.0: Extended database and language capabilities

D4M 3.0: Extended database and language capabilities

LaraDB

Scaling sparse matrix-matrix multiplication in the accumulo database

Contact Info

Product

Resources

About