Loading databases using dataflow parallelism

Barclay, Thomas; Barnes, Robert; Gray, Jim; Sundaresan, Prakash

doi:10.1145/190627.190647

Cited by 26 publications

(10 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sequential loads can take a very long time, e.g., loading a terabyte of data can take weeks and months! Hence, pipelined and partitioned parallelism are typically exploited 6 . Doing a full load has the advantage that it can be treated as a long batch transaction that builds up a new database.…”

Section: Loadmentioning

confidence: 99%

An overview of data warehousing and OLAP technology

1997

View full text Add to dashboard Cite

Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse. In addition to surveying the state of the art, this paper also identifies some promising research issues, some of which are related to problems that the database research community has worked on for years, but others are only just beginning to be addressed. This overview is based on a tutorial that the authors presented at

show abstract

Section: Loadmentioning

confidence: 99%

An overview of data warehousing and OLAP technology

1997

View full text Add to dashboard Cite

show abstract

“…Parallel databases Dryad is heavily indebted to the traditional parallel database field [18]: e.g., Vulcan [22], Gamma [17], RDb [11], DB2 parallel edition [12], and many others. Many techniques for exploiting parallelism, including data partitioning; pipelined and partitioned parallelism; and hash-based distribution are directly derived from this work.…”

Section: Dataflowmentioning

confidence: 99%

Dryad

Isard

Budiu

et al. 2007

SIGOPS Oper. Syst. Rev.

515

View full text Add to dashboard Cite

Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational "vertices" with communication "channels" to form a dataflow graph. Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through files, TCP pipes, and shared-memory FIFOs.The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources.Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers. The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

show abstract

“…In database systems, parallel sorting is most heavily used in data loading and index creation [Barclay et al 1994]. One key issue is data skew and load balancing, especially when using range partitioning [Iyer and Dias 1990;Manku et al 1998].…”

Section: Parallelism and Threadingmentioning

confidence: 99%

Implementing sorting in database systems

Graefe

2006

ACM Comput. Surv.

107

View full text Add to dashboard Cite

Most commercial database systems do (or should) exploit many sorting techniques that are publicly known, but not readily available in the research literature. These techniques improve both sort performance on modern computer systems and the ability to adapt gracefully to resource fluctuations in multiuser operations. This survey collects many of these techniques for easy reference by students, researchers, and product developers. It covers in-memory sorting, disk-based external sorting, and considerations that apply specifically to sorting in database systems.

show abstract

Loading databases using dataflow parallelism

Cited by 26 publications

References 3 publications

An overview of data warehousing and OLAP technology

An overview of data warehousing and OLAP technology

Dryad

Implementing sorting in database systems

Contact Info

Product

Resources

About