DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Oliveira, Geraldo F.; Gómez-Luna, Juan; Orosa, Lois; Ghose, Saugata; Vijaykumar, Nandita; Fernández, Iván; Sadrosadati, Mohammad; Mutlu, Onur

doi:10.1109/access.2021.3110993

Cited by 43 publications

(27 citation statements)

References 390 publications

(281 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We focus on three characteristics of NDP architectures that are of particular importance in the synchronization context. First, NDP architectures typically do not have a shared level of cache memory [8, 19, 25, 38, 42-46, 49, 55, 67, 98, 110, 111, 113, 119, 155, 158], since the NDP-suited workloads usually do not benefit from deep cache hierarchies due to their poor locality [33,43,133,143]. Second, NDP architectures do not typically support conventional hardware cache coherence protocols [8,19,25,38,[42][43][44][45]49,55,67,82,98,111,119,155,158], because they would add area and traffic overheads [46,143], and would incur high complexity and latency [4], limiting the benefits of NDP.…”

Section: Memory Arraysmentioning

confidence: 99%

“…Recent research demonstrates the benefits of NDP for parallel applications, e.g., for genome analysis [23,84], graph processing [8,9,20,21,112,155,158], databases [20,38], security [54], pointer-chasing workloads [25,60,67,99], and neural networks [19,45,82,98]. In general, these applications exhibit high parallelism, low operational intensity, and relatively low cache locality [15,16,33,50,133], which make them suitable for NDP.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

Giannoula¹,

Vijaykumar²,

Παπαδοπούλου³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units, each including multiple simple cores placed close to memory. To fully leverage the benefits of NDP and achieve high performance for parallel workloads, efficient synchronization among the NDP cores of a system is necessary. However, supporting synchronization in many NDP systems is challenging because they lack shared caches and hardware cache coherence support, which are commonly used for synchronization in multicore systems, and communication across different NDP units can be expensive.This paper comprehensively examines the synchronization problem in NDP systems, and proposes SynCron, an endto-end synchronization solution for NDP systems. SynCron adds low-cost hardware support near memory for synchronization acceleration, and avoids the need for hardware cache coherence support. SynCron has three components: 1) a specialized cache memory structure to avoid memory accesses for synchronization and minimize latency overheads, 2) a hierarchical message-passing communication protocol to minimize expensive communication across NDP units of the system, and 3) a hardware-only overflow management scheme to avoid performance degradation when hardware resources for synchronization tracking are exceeded.We evaluate SynCron using a variety of parallel workloads, covering various contention scenarios. SynCron improves performance by 1.27× on average (up to 1.78×) under highcontention scenarios, and by 1.35× on average (up to 2.29×) under low-contention real applications, compared to state-ofthe-art approaches. SynCron reduces system energy consumption by 2.08× on average (up to 4.25×).

show abstract

Section: Memory Arraysmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

Giannoula¹,

Vijaykumar²,

Παπαδοπούλου³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Since main memory is a growing system performance and energy bottleneck [12,39,58,100,103,107,111,134,149,153,155], a RowHammer mitigation mechanism should exhibit acceptable performance and energy overheads at low area cost when configured for more vulnerable DRAM chips.…”

Section: Scaling With Increasing Rowhammer Vulnerabilitymentioning

confidence: 99%

BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows

Yağlıkçı,

Patel,

Kim

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Aggressive memory density scaling causes modern DRAM devices to suffer from RowHammer, a phenomenon where rapidly activating (i.e., hammering) a DRAM row can cause bit-flips in physically-nearby rows. Recent studies demonstrate that modern DDR4/LPDDR4 DRAM chips, including chips previously marketed as RowHammer-safe, are even more vulnerable to RowHammer than older DDR3 DRAM chips. Many works show that attackers can exploit RowHammer bit-flips to reliably mount system-level attacks to escalate privilege and leak private data. Therefore, it is critical to ensure RowHammersafe operation on all DRAM-based systems as they become increasingly more vulnerable to RowHammer. Unfortunately, state-of-the-art RowHammer mitigation mechanisms face two major challenges. First, they incur increasingly higher performance and/or area overheads when applied to more vulnerable DRAM chips. Second, they require either closely-guarded proprietary information about the DRAM chips' physical circuit layouts or modifications to the DRAM chip design.In this paper, we show that it is possible to efficiently and scalably prevent RowHammer bit-flips without knowledge of or modification to DRAM internals. To this end, we introduce BlockHammer, a low-cost, effective, and easy-to-adopt Row-Hammer mitigation mechanism that prevents all RowHammer bit-flips while overcoming the two key challenges. BlockHammer selectively throttles memory accesses that could otherwise potentially cause RowHammer bit-flips. The key idea of Block-Hammer is to (1) track row activation rates using area-efficient Bloom filters, and (2) use the tracking data to ensure that no row is ever activated rapidly enough to induce RowHammer bit-flips. By guaranteeing that no DRAM row ever experiences a RowHammer-unsafe activation rate, BlockHammer (1) makes it impossible for a RowHammer bit-flip to occur and (2) greatly reduces a RowHammer attack's impact on the performance of co-running benign applications. Our evaluations across a comprehensive range of 280 workloads show that, compared to the best of six state-of-the-art RowHammer mitigation mechanisms (all of which require knowledge of or modification to DRAM internals), BlockHammer provides (1) competitive performance and energy when the system is not under a RowHammer attack and (2) significantly better performance and energy when the system is under a RowHammer attack.

show abstract

“…The key question in this approach is which functions in an application should be offloaded for PNM acceleration. Several recent works tackle this question for various applications, e.g., mobile consumer workloads [7], GPGPU workloads [86,87], graph processing and in-memory database workloads [62,179], and a wide variety of workloads from many domains [16]. We will discuss function-level PNM acceleration of mobile consumer workloads in this section, focusing on our recent work on the topic [7].…”

Section: Function-level Pnm Acceleration Of Mobile Consumer Workloadsmentioning

confidence: 99%

“…Across all of these systems,the data working set sizes of modern applications are rapidly growing, while the need for fast analysis of such data is increasing. Thus, main memory is becoming an increasingly significant bottleneck across a wide variety of computing systems and applications [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]. Alleviating the main memory bottleneck requires the memory capacity, energy, cost, and performance to all scale in an efficient manner across technology generations.…”

Section: Introductionmentioning

confidence: 99%

A Modern Primer on Processing in Memory

Mutlu¹,

Ghose²,

Gómez-Luna³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks:(1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today.At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend.This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated. While the general idea of PIM is not new, we discuss motivating trends in applications as well as memory circuits/technology that greatly exacerbate the need for enabling it in modern computing systems. We examine at least two promising new approaches to designing PIM systems to accelerate important data-intensive applications: (1) processing using memory by exploiting analog operational properties of DRAM chips to perform massively-parallel operations in memory, with low-cost changes, (2) processing near memory by exploiting 3D-stacked memory technology design to provide high memory bandwidth and low memory latency to in-memory logic. In both approaches, we describe and tackle relevant cross-layer research, design, and adoption challenges in devices, architecture, systems, and programming models. Our focus is on the development of in-memory processing designs that can be adopted in real computing platforms at low cost. We conclude by discussing work on solving key challenges to the practical adoption of PIM.

show abstract

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Cited by 43 publications

References 390 publications

SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows

A Modern Primer on Processing in Memory

Contact Info

Product

Resources

About