The Cedar System And An Initial Performance Study

Kuck, David J.; Davidson, Edward S.; Lawrie, Duncan H.; Sameh, Ahmed H.; Zhu, Changfeng; Veidenbaum, Alexander V.; Konicek, J.; Yew, Pen-Chung; Gallivan, Kyle A.; Jalby, William; Wijshoff, Harry A. G.; Bramley, Randall; Yang, Ulrike Meier; Emrath, Perry A.; Padua, David; Eigenmann, Rudolf; Hoeflinger, Jay; Jaxon, G.; Li, Z.; Murphy, Thérèse; Andrews, John B.; Turner, Stephen W.

doi:10.1109/isca.1993.698562

Cited by 26 publications

(16 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Illinois studies are traditional [8,15]; they extend Kap, an automatic parallelizer, and then use it to parallelize the Perfect Benchmarks, dusty deck programs. Their target architecture is Cedar, a shared-memory parallel machine with cluster memory and vector processors.…”

Section: Related Workmentioning

confidence: 99%

“…We believe that just as vectorization was not successful for dusty deck programs, that when programmers have never considered medium to large grain parallelism, automatic parallelization is doomed to failure. Indeed, finding medium to large grain parallelism is more difficult than single statement parallelism and compilers have had few successes on dusty deck programs [8,15,20,21].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

McKinley

1994

Proceedings of the 8th International Conference on Supercomputing - ICS '94

View full text Add to dashboard Cite

We present a parallel code generation algorithm for complete applications and a new experimental methodology that tests the efficacy of our approach. The algorithm optimizes for data locality and parallelism, reducing or eliminating false sharing. It also uses interprocedural analysis and transformations to improve the granularity of parallelism. Although the individual components of the algorithm have been published previously, their coordination is unique to this paper. For experimental validation, we do not attempt to parallelize 'dusty deck' programs where many have tried and failed. Instead, we collect programs where the users tried to achieve excellent parallel performance. We apply our optimizations to sequential versions of these programs, i.e., the compiler was required to use its analysis and algorithms to parallelize the program and could not rely on user assertions that for example, a loop is parallel. With this metric, our algorithm improves or matches hand-coded parallel programs on shared-memory, bus-based parallel machines for eight of the nine programs in our test suite.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

McKinley

1994

Proceedings of the 8th International Conference on Supercomputing - ICS '94

View full text Add to dashboard Cite

show abstract

“…However, the experience with parallel applications has shown that reorganizing a parallel program to exploit just two levels of architectural hierarchy is a nontrivial problem (see, for example, experiences from the Cedar project [10,11,18]). Software technology probably will limit the level of system hierarchy to a very small number, most likely at two levels, in the foreseeable future.…”

Section: Introductionmentioning

confidence: 99%

Performance Evaluation of Wire-Limited Hierarchical Networks

Hsu

Yew

1997

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

“…The most related architectural work that we are aware of is the work of Larus et al [20], Zhang et al [28], and the work on advanced synchronization mechanisms [3,9,10,16,17,18,23,24,25,29].…”

Section: Related Workmentioning

confidence: 99%

“…Such work includes the Full/Empty bit of the HEP multiprocessor [25], the atomic Fetch&Add primitive of the NYU Ultracomputer [10], the Fetch&Op synchronization primitives of the IBM RP3 [3,23], support for combining trees [16,24], the memory-based synchronization primitives in Cedar [17,18,29], and the set of synchronization primitives proposed by Goodman et al [9].…”

Section: Related Workmentioning

confidence: 99%

Architectural support for parallel reductions in scalable shared-memory multiprocessors

Garzarán¹,

Prvulovic²,

Zhang³

et al.

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Reductions are important and time-consuming operations in many scientific codes. Effective parallelization of reductions is a critical transformation for loop parallelization, especially for sparse, dynamic applications. Unfortunately, conventional reduction parallelization algorithms are not scalable.In this paper, we present new architectural support that significantly speeds-up parallel reduction and makes it scalable in shared-memory multiprocessors. The required architectural changes are mostly confined to the directory controllers. Experimental results based on simulations show that the proposed support is very effective. While conventional software-only reduction parallelization delivers average speedups of only 2.7 for 16 processors, our scheme delivers average speedups of 7.6.

show abstract

The Cedar System And An Initial Performance Study

Cited by 26 publications

References 7 publications

Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

Performance Evaluation of Wire-Limited Hierarchical Networks

Architectural support for parallel reductions in scalable shared-memory multiprocessors

Contact Info

Product

Resources

About