In the following sections, we summarize the contributions made through support from this DOE ECPI award to research and training in advanced computing systems.1 Dynamic scheduling of layered parallelism on emerging multi-core processors and many-core clustersWe have developed several schedulers for dynamic multi-grain parallelization on the Cell Broadband Engine. The Cell processor presents a new paradigm for parallel computing on multicore platforms, by combining conventional processor cores with customized accelerators and by offering an explicitly managed memory hierarchy to programmers, for tighter control of locality and performance. Parallel computation on the Cell is accomplished by off-loading compute-intensive and data-intensive code from the conventional cores to the vector SIMD accelerators. Heterogeneous multi-core architectures such as the Cell represent a design point in computer architecture which holds greater promise for sustaining high performance and power-efficiency than conventional, homogeneous multi-core architectures. Cell is also the processor of choice for Roadrunner, a Petaflop-capable supercomputer currently in the development phase by IBM. Due to these reasons, we believe that the research conducted on Cell with support from the DOE ECPI award is timely, relevant and in line with DOE missions. The first of the novel schedulers developed in this activity, named MGPS-SLED (for Multi-grain Parallelism Scheduling using Slack Minimizing Event-Drive execution), exploits effectively thread-level and data-level (SIMD) parallelism at runtime, without prior knowledge of the application or input from the programmer. MGPS-SLED follows an event-driven execution model for scheduling tasks and data parallelism of varying granularity, on the synergistic processing elements (SPE) of the Cell. MGPS-SLED provides a novel mechanism for deciding between task-level, loop-level and data-level parallelization on the fly, based on runtime workload characterization and observable utilization metrics on the SPEs. As part of the MGPS-SLED effort, we have ported the MELISSES hardware monitor on the Cell PPE and SPE -the conventional power processing element and the synergistic processing elements of the processor respectively-, to collect continuous data on SPE and PPE utilization and drive the multi-grain decomposition and scheduling processes. More specifically, MELISSES enabled us to collect a historical profile of task execution on the SPE, which in conjunction with program phase analysis, enabled MGPS-SLED to adaptively select the layers and degrees of parallelism to activate in any phase of the program. We emphasize the major contribution of MGPS-SLED, namely phase-aware optimization of the scheduling process, which would have been impossible without leveraging the MELISSES performance monitoring framework. Phase-aware program control in MELISSES has enabled unprecedented performance and power optimizations in parallel programs. We view this result as one of the major contributions of this effort.MGPS-SLED...