Proceedings of the 2018 International Conference on Supercomputing 2018
DOI: 10.1145/3205289.3205310
|View full text |Cite
|
Sign up to set email alerts
|

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies

Abstract: Shared memory systems are becoming increasingly complex as they typically integrate several storage devices. That brings different access latencies or bandwidth rates depending on the proximity between the cores where memory accesses are issued and the storage devices containing the requested data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are generally applied by the system software. We propose techniques … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
4

Relationship

3
6

Authors

Journals

citations
Cited by 23 publications
(14 citation statements)
references
References 32 publications
0
14
0
Order By: Relevance
“…Many works exploit the characteristics of task dataflow programming models to perform optimizations [24], [93]. The runtime system can transparently manage GPUs [7], [76], FPGA accelerators [18], [85], multi-node clusters [20], [27], [28], heterogeneous memories [4], [63], scratchpad memories [5], NUMA [81], [82] and cache coherent NUMA [21], [23] systems. Adding hardware support, the runtime system can guide cache replacement [37], [65], cache coherence deactivation [22], cache prefetching [47], [75], cache communication mechanisms in producer-consumer task relationships [64], [66], reliability and resilience [51]- [53], value approximation [19], and DVFS to accelerate critical tasks [26].…”
Section: Task Dataflow Programming Modelsmentioning
confidence: 99%
“…Many works exploit the characteristics of task dataflow programming models to perform optimizations [24], [93]. The runtime system can transparently manage GPUs [7], [76], FPGA accelerators [18], [85], multi-node clusters [20], [27], [28], heterogeneous memories [4], [63], scratchpad memories [5], NUMA [81], [82] and cache coherent NUMA [21], [23] systems. Adding hardware support, the runtime system can guide cache replacement [37], [65], cache coherence deactivation [22], cache prefetching [47], [75], cache communication mechanisms in producer-consumer task relationships [64], [66], reliability and resilience [51]- [53], value approximation [19], and DVFS to accelerate critical tasks [26].…”
Section: Task Dataflow Programming Modelsmentioning
confidence: 99%
“…Online methods [4,9,38,45] profile and make optimization decisions as the application runs. However, this necessitates very low overhead profiling to avoid outweighing the optimization gains.…”
Section: Related Workmentioning
confidence: 99%
“…Reducing data movement (data transfer between processors and system memory) in such systems, will improve overall performance. A number of techniques (and associated studies) have focused on this goal; these include: load balancing [16], [17] [18], graph partitioning [19] [20] and spatial partitioning (or spatial messaging) [1] [21]. Generally, the graph partitioning algorithm evenly divides work among computation nodes to minimize data movement.…”
Section: B Reduction Of Memory Movementmentioning
confidence: 99%
“…Generally, the graph partitioning algorithm evenly divides work among computation nodes to minimize data movement. To improve performance and reduce data transfer across the system, Barrera et al [19] used the graph partitioning technique. They automatically applied task dependency graphs during system runtime to collect information, and then used advanced graph partitioning to break the graphs into smaller parts.…”
Section: B Reduction Of Memory Movementmentioning
confidence: 99%