Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies

Barrera, Isaac Sánchez; Moretó, Miquel; Ayguadé, Eduard; Labarta, Jesús; Valero, Mateo; Casas, Marc

doi:10.1145/3205289.3205310

Cited by 23 publications

(14 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many works exploit the characteristics of task dataflow programming models to perform optimizations [24], [93]. The runtime system can transparently manage GPUs [7], [76], FPGA accelerators [18], [85], multi-node clusters [20], [27], [28], heterogeneous memories [4], [63], scratchpad memories [5], NUMA [81], [82] and cache coherent NUMA [21], [23] systems. Adding hardware support, the runtime system can guide cache replacement [37], [65], cache coherence deactivation [22], cache prefetching [47], [75], cache communication mechanisms in producer-consumer task relationships [64], [66], reliability and resilience [51]- [53], value approximation [19], and DVFS to accelerate critical tasks [26].…”

Section: Task Dataflow Programming Modelsmentioning

confidence: 99%

TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming Models

Caheny

Alvarez

Casas

et al. 2022

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

In high performance processors, the design of onchip memory hierarchies is crucial for performance and energy efficiency. Current processors rely on large shared Non-Uniform Cache Architectures (NUCA) to improve performance and reduce data movement. Multiple solutions exploit information available at the microarchitecture level or in the operating system to optimize NUCA performance. However, existing methods have not taken advantage of the information captured by task dataflow programming models to guide the management of NUCA caches.In this paper we propose TD-NUCA, a hardware/software codesigned approach that leverages information present in the runtime system of task dataflow programming models to efficiently manage NUCA caches. TD-NUCA identifies the data access and reuse patterns of parallel applications in the runtime system and guides the operation of the NUCA caches in the hardware. As a result, TD-NUCA achieves a 1.18x average speedup over the baseline S-NUCA while requiring only 0.62x the data movement.

show abstract

Section: Task Dataflow Programming Modelsmentioning

confidence: 99%

TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming Models

Caheny

Alvarez

Casas

et al. 2022

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

show abstract

“…Online methods [4,9,38,45] profile and make optimization decisions as the application runs. However, this necessitates very low overhead profiling to avoid outweighing the optimization gains.…”

Section: Related Workmentioning

confidence: 99%

Modeling and optimizing NUMA effects and prefetching with machine learning

Barrera

Black-Schaffer

Casas

et al. 2020

Proceedings of the 34th ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

Both NUMA thread/data placement and hardware prefetcher configuration have significant impacts on HPC performance. Optimizing both together leads to a large and complex design space that has previously been impractical to explore at runtime. In this work we deliver the performance benefits of optimizing both NUMA thread/data placement and prefetcher configuration at runtime through careful modeling and online profiling. To address the large design space, we propose a prediction model that reduces the amount of input information needed and the complexity of the prediction required. We do so by selecting a subset of performance counters and application configurations that provide the richest profile information as inputs, and by limiting the output predictions to a subset of configurations that cover most of the performance. Our model is robust and can choose near-optimal NUMA+Prefetcher configurations for applications from only two profile runs. We further demonstrate how to profile online with low overhead, resulting in a technique that delivers an average of 1.68× performance improvement over a locality-optimized NUMA baseline with all prefetchers enabled. CCS CONCEPTS • Computer systems organization → Multicore architectures; • General and reference → Performance; Measurement; • Software and its engineering → Memory management; • Computing methodologies → Cluster analysis; Supervised learning; Crossvalidation; Model verification and validation; Model development and analysis.

show abstract

“…Reducing data movement (data transfer between processors and system memory) in such systems, will improve overall performance. A number of techniques (and associated studies) have focused on this goal; these include: load balancing [16], [17] [18], graph partitioning [19] [20] and spatial partitioning (or spatial messaging) [1] [21]. Generally, the graph partitioning algorithm evenly divides work among computation nodes to minimize data movement.…”

Section: B Reduction Of Memory Movementmentioning

confidence: 99%

“…Generally, the graph partitioning algorithm evenly divides work among computation nodes to minimize data movement. To improve performance and reduce data transfer across the system, Barrera et al [19] used the graph partitioning technique. They automatically applied task dependency graphs during system runtime to collect information, and then used advanced graph partitioning to break the graphs into smaller parts.…”

Section: B Reduction Of Memory Movementmentioning

confidence: 99%

Data Aware Simulation of Complex Systems on GPUs

Alzahrani

Simons

Richmond

2019

2019 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

GPUs have been demonstrated to be highly effective at improving the performance of Multi-Agent Systems (MAS). One of the major limitations of further performance improvements is in the memory bandwidth required to move agent data through the GPU's memory hierarchy. This paper presents a formal model for data aware simulation and an empirical study into the impact of minimising data movement on performance. This study proposes a method that can be applied to the simulation of complex systems on GPUs to extract required data from agent behaviour during simulation time and how this information can be used to reduce data movement. The FLAME GPU software has been extended to demonstrate this technique. Three benchmark experiments have been applied to evaluate the overall reduction in simulation execution time under specific criteria. The results of the comparison between the current and new system show that reducing data movement within a simulation improves overall performance with up to 4.8x speedup reported.

show abstract

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies

Cited by 23 publications

References 32 publications

TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming Models

TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming Models

Modeling and optimizing NUMA effects and prefetching with machine learning

Data Aware Simulation of Complex Systems on GPUs

Contact Info

Product

Resources

About