Near memory data structure rearrangement

Gokhale, Maya; Lloyd, Scott; Hajas, Chris

doi:10.1145/2818950.2818986

Cited by 30 publications

(16 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Gokhale 1 (2015) Gokhale et al [24] proposed to place a data rearrangement engine (DRE) in the logic layer of the HMC to accelerate data accesses while still performing the computation on the main CPU. The authors targeted cache unfriendly applications with high memory latency due to irregular access patterns, e.g., sparse matrix multiplication.…”

Section: Re-configurable Unitmentioning

confidence: 99%

A Review of Near-Memory Computing Architectures: Opportunities and Challenges

Singh¹,

Chelini²,

Corda³

et al. 2018

2018 21st Euromicro Conference on Digital System Design (DSD)

View full text Add to dashboard Cite

DOI to the publisher's website.• The final author version and the galley proof are versions of the publication after peer review.• The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal.If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the "Taverne" license above, please follow below link for the End User Agreement:

show abstract

Section: Re-configurable Unitmentioning

confidence: 99%

A Review of Near-Memory Computing Architectures: Opportunities and Challenges

Singh¹,

Chelini²,

Corda³

et al. 2018

2018 21st Euromicro Conference on Digital System Design (DSD)

View full text Add to dashboard Cite

show abstract

“…First, NDP architectures typically do not have a shared level of cache memory [8, 19, 25, 38, 42-46, 49, 55, 67, 98, 110, 111, 113, 119, 155, 158], since the NDP-suited workloads usually do not benefit from deep cache hierarchies due to their poor locality [33,43,133,143]. Second, NDP architectures do not typically support conventional hardware cache coherence protocols [8,19,25,38,[42][43][44][45]49,55,67,82,98,111,119,155,158], because they would add area and traffic overheads [46,143], and would incur high complexity and latency [4], limiting the benefits of NDP. Third, communication across NDP units is expensive, because NDP systems are non-uniform distributed architectures.…”

Section: Memory Arraysmentioning

confidence: 99%

“…First, most NDP architectures [8, 19, 25, 38, 42-46, 49, 55, 67, 98, 110, 111, 113, 119, 155, 158] lack shared caches that can enable low-cost communication and synchronization among NDP cores of the system. Second, hardware cache coherence protocols are typically not supported in NDP systems [8,19,25,38,[42][43][44][45]49,55,67,82,98,111,119,155,158], due to high area and traffic overheads associated with such protocols [46,143]. Third, NDP systems are non-uniform, distributed architectures, in which inter-unit communication is more expensive (both in performance and energy) than intraunit communication [8,20,21,38,43,83,155,158].…”

Section: Introductionmentioning

confidence: 99%

SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

Giannoula¹,

Vijaykumar²,

Παπαδοπούλου³

et al. 2021

Preprint

View full text Add to dashboard Cite

Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units, each including multiple simple cores placed close to memory. To fully leverage the benefits of NDP and achieve high performance for parallel workloads, efficient synchronization among the NDP cores of a system is necessary. However, supporting synchronization in many NDP systems is challenging because they lack shared caches and hardware cache coherence support, which are commonly used for synchronization in multicore systems, and communication across different NDP units can be expensive.This paper comprehensively examines the synchronization problem in NDP systems, and proposes SynCron, an endto-end synchronization solution for NDP systems. SynCron adds low-cost hardware support near memory for synchronization acceleration, and avoids the need for hardware cache coherence support. SynCron has three components: 1) a specialized cache memory structure to avoid memory accesses for synchronization and minimize latency overheads, 2) a hierarchical message-passing communication protocol to minimize expensive communication across NDP units of the system, and 3) a hardware-only overflow management scheme to avoid performance degradation when hardware resources for synchronization tracking are exceeded.We evaluate SynCron using a variety of parallel workloads, covering various contention scenarios. SynCron improves performance by 1.27× on average (up to 1.78×) under highcontention scenarios, and by 1.35× on average (up to 2.29×) under low-contention real applications, compared to state-ofthe-art approaches. SynCron reduces system energy consumption by 2.08× on average (up to 4.25×).

show abstract

“…Other works couple GPU architectures with 3D stacked memories [16], [17]. Still others utilize reconfigurable logic near the DRAM [18], [19], [20].…”

Section: Near Memory Processing (Nmp)mentioning

confidence: 99%

BLADE: An in-Cache Computing Architecture for Edge Devices

Simon

Qureshi

Rios

et al. 2020

IEEE Trans. Comput.

View full text Add to dashboard Cite

Area and power constrained edge devices are increasingly utilized to perform compute intensive workloads, necessitating increasingly area and power efficient accelerators. In this context, in-SRAM computing performs hundreds of parallel operations on spatially local data common in many emerging workloads, while reducing power consumption due to data movement. However, in-SRAM computing faces many challenges, including integration into the existing architecture, arithmetic operation support, data corruption at high operating frequencies, inability to run at low voltages, and low area density. To meet these challenges, this work introduces BLADE, a BitLine Accelerator for Devices on the Edge. BLADE is an in-SRAM computing architecture that utilizes local wordline groups to perform computations at a frequency 2.8x higher than state-of-the-art in-SRAM computing architectures. BLADE is integrated into the cache hierarchy of low-voltage edge devices, and simulated and benchmarked at the transistor, architecture, and software abstraction levels. Experimental results demonstrate performance/energy gains over an equivalent NEON accelerated processor for a variety of edge device workloads, namely, cryptography (4x performance gain/6x energy reduction), video encoding (6x/2x), and convolutional neural networks (3x/1.5x), while maintaining the highest frequency/energy ratio (up to 2.2Ghz@1V) of any conventional in-SRAM computing architecture, and a low area overhead of less than 8%.

show abstract

Near memory data structure rearrangement

Cited by 30 publications

References 11 publications

A Review of Near-Memory Computing Architectures: Opportunities and Challenges

A Review of Near-Memory Computing Architectures: Opportunities and Challenges

SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

BLADE: An in-Cache Computing Architecture for Edge Devices

Contact Info

Product

Resources

About