Red-blue pebbling revisited

Kwasniewski, Grzegorz; Kabic, Marko; Besta, Maciej; VandeVondele, Joost; Solcà, Raffaele; Hoefler, Torsten

doi:10.1145/3295500.3356181

Cited by 58 publications

(11 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Communication-efficient algorithms have been developed for many concrete computational problems and models, including for example matrix multiplication [17,18], FFT [17,1], sorting [1], directed shortest paths [23], topological sorting [2], matrix transposition [1], the N -Body problem [14], QR-and LU-factorization [13], prime tables [5], and Cholesky decomposition [4]. In the blocked-I/O-model [1], using time-forward processing one can compute functions that have a given computation DAG G = (V, E) with O(Sort(|E|)) I/Os [10].…”

Section: Communication-efficient Algorithmsmentioning

confidence: 99%

See 1 more Smart Citation

The Red-Blue Pebble Game on Trees and DAGs with Large Input

Gleinig

Hoefler

2022

Structural Information and Communication Complexity

Self Cite

View full text Add to dashboard Cite

Data movements between different levels of a memory hierarchy (I/Os) are a principal performance bottleneck. This is particularly noticeable in computations that have low complexity but large amounts of input data, often occurring in "big data". Using the red-blue pebble game, we investigate the I/O-complexity of directed acyclic graphs (DAGs) with a large proportion of input vertices. For trees, we show that the number of leaves is a 2approximation for the optimal number of I/Os. Similar techniques as we use in the proof of the results for trees allow us to find lower and upper bounds of the optimal number of I/Os for general DAGs. The larger the proportion of input vertices, the stronger those bounds become. For families of DAGs with bounded degree and a large proportion of input vertices (meaning that there exists some constant c > 0 such that for every DAG G of this family, the proportion p of input vertices satisfies p > c) our bounds give constant factor approximations, improving the previous logarithmic approximation factors. For those DAGs, by avoiding certain I/O-inefficiencies, which we will define precisely, a pebbling strategy is guaranteed to satisfy those bounds and asymptotics. We extend the I/O-bounds for trees to a multiprocessor setting with fast individual memories and a slow shared memory.

show abstract

Section: Communication-efficient Algorithmsmentioning

confidence: 99%

“…The red-blue pebble game allows to analyze and optimize the I/Os of general computations. For example, it has been used to optimize the I/Os of classical matrix multiplication [17,18], which can be considered very opposite to the computations of this paper as it allows extensive data reuse.…”

Section: Communication-efficient Algorithmsmentioning

confidence: 99%

The Red-Blue Pebble Game on Trees and DAGs with Large Input

Gleinig

Hoefler

2022

Structural Information and Communication Complexity

Self Cite

View full text Add to dashboard Cite

show abstract

“…Computing an I/O complexity upper bound for an algorithm is the most reasonable way to assess the tightness of a lower bound. While this computation is usually done by hand using ad hoc techniques specific to each studied algorithm [1,12,23,28,31,36], Fauzia et al [15] proposed a heuristic that directly reasons on the CDAG, which unfortunately does not scale to real programs. Finding an upper bound for a fixed architecture can also be viewed as finding an optimized program transformation that minimizes data movement costs, which also implies being able to evaluate this cost.…”

Section: Related Workmentioning

confidence: 99%

IOOpt: automatic derivation of I/O complexity bounds for affine programs

Olivry

Iooss

Tollenaere

et al. 2021

Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

View full text Add to dashboard Cite

This work was supported in part by the U.S. National Science Foundation through award 2018016, by MIAI Grenoble Alpes (ANR-19-P3IA-0003), and by the Bpifrance Programme d'Investissements d'Avenir (PIA) as part of the ES3CAP project.contraction and convolution kernels. Then we evaluate numerically the tightness of our bound using the convolution layers of Yolo9000 and representative tensor contractions from the TCCG benchmark suite. Finally, we show the pertinence of our I/O complexity model by reporting the running time of the recommended tiled code for the convolution layers of Yolo9000.

show abstract

“…This fact indicates that, sequentially executing the dataflow and assigning most of the effective on-chip memory to the outputs can reach the minimum off-chip memory access. Otherwise, if we perform the dataflow in parallel, The equation (21) means that fully utilizing the on-chip memory owned by each processor to produce the partial sum could maximize the output data reuse and reduce the data transmission in the memory hierarchy.…”

Section: Dataflowmentioning

confidence: 99%

“…After Hong & Kung established the I/O complexity theory [17], Savage developed the notion of S-span to derive Hong-Kung style lower bounds [23]. Kwasniewski et al provided a new proof of I/O complexity of matrix-matrix multiplication and designed a parallel algorithm to reach its lower bound [21]. Although the red-blue pebble game model has been proposed for many years [1-3, 12, 24, 28], it is still difficult to use this model to establish I/O lower bounds of composite algorithms which involve several different kinds of computational patterns [13].…”

Section: Related Workmentioning

confidence: 99%

Communication Lower Bounds of Convolutions in CNNs

Zhang

Xiao

Tan

2020

Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures

View full text Add to dashboard Cite

Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous practical applications. Due to the complex data dependency and the increase in the amount of model samples, the convolution suffers from high overhead on data movement (i.e., memory access). This work provides comprehensive analysis and methodologies to minimize the communication for the convolution in CNNs. With an in-depth analysis of the recent I/O complexity theory under the red-blue game model, we develop a general I/O lower bound theory for a composite algorithm which consists of several different sub-computations. Based on the proposed theory, we establish the data movement lower bound results for two main convolution algorithms in CNNs, namely the direct convolution and Winograd algorithm, which represents the direct and indirect implementations of a convolution respectively. Next, derived from I/O lower bound results, we design the near I/O-optimal dataflow strategies for the two main convolution algorithms by fully exploiting the data reuse. Furthermore, in order to push the envelope of performance of the near I/O-optimal dataflow strategies further, an aggressive design of auto-tuning based on I/O lower bounds, is proposed to search an optimal parameter configuration for the direct convolution and Winograd algorithm on GPU, such as the number of threads and the size of shared memory used in each thread block. Finally, experiment evaluation results on the direct convolution and Winograd algorithm show that our dataflow strategies with the auto-tuning approach can achieve about 3.32× performance speedup on average over cuDNN. In addition, compared with TVM, which represents the state-of-the-art technique for auto-tuning, not only our auto-tuning method based on I/O lower bounds can find the optimal parameter configuration faster, but also our solution has higher performance than the optimal solution provided by TVM.

show abstract

Red-blue pebbling revisited

Cited by 58 publications

References 38 publications

The Red-Blue Pebble Game on Trees and DAGs with Large Input

The Red-Blue Pebble Game on Trees and DAGs with Large Input

IOOpt: automatic derivation of I/O complexity bounds for affine programs

Communication Lower Bounds of Convolutions in CNNs

Contact Info

Product

Resources

About