Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Talati, Nishil; May, Kyle; Behroozi, Armand; Yang, Yi-Chen; Kaszyk, Kuba; Vasiladiotis, Christos; Verma, Tarunesh; Lu, Li; Nguyen, Brandon; Sun, Jiawen; Morton, John Magnus; Ahmadi, Agreen; Austin, Todd; O’Boyle, Michael; Mahlke, Scott; Mudge, Trevor; Dreslinski, Ronald G.

doi:10.1109/hpca51647.2021.00061

Cited by 35 publications

(17 citation statements)

References 94 publications

(125 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Optimizing Irregular Memory Accesses Recent work has made significant strides in domain-agnostic prefetching for irregular applications [10,43,57]. Our split-tree structure can be seen as an application-specific prefetcher and achieves "perfect prefetching" in that 1) off-chip data accesses are overlapped with computation, 2) data needed by the accelerator are readily available on-chip without stalls, and 3) no redundant DRAM accesses are needed.…”

Section: Related Workmentioning

confidence: 99%

Crescent: Taming Memory Irregularities for Accelerating Deep Point Cloud Analytics

Ye¹,

Hammonds²,

Gan³

et al. 2022

Preprint

View full text Add to dashboard Cite

3D perception in point clouds is transforming the perception ability of future intelligent machines. Point cloud algorithms, however, are plagued by irregular memory accesses, leading to massive inefficiencies in the memory sub-system, which bottlenecks the overall efficiency.This paper proposes Crescent, an algorithm-hardware co-design system that tames the irregularities in deep point cloud analytics while achieving high accuracy. To that end, we introduce two approximation techniques, approximate neighbor search and selectively bank conflict elision, that "regularize" the DRAM and SRAM memory accesses. Doing so, however, necessarily introduces accuracy loss, which we mitigate by a new network training procedure that integrates approximation into the network training process. In essence, our training procedure trains models that are conditioned upon a specific approximate setting and, thus, retain a high accuracy. Experiments show that Crescent doubles the performance and halves the energy consumption compared to an optimized baseline accelerator with < 1% accuracy loss.

show abstract

Section: Related Workmentioning

confidence: 99%

Crescent: Taming Memory Irregularities for Accelerating Deep Point Cloud Analytics

Ye¹,

Hammonds²,

Gan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Prior work has studied many optimizations that transform data as it moves through the cache, e.g., to compress [9,36,90,106,107,118,136,146], decrypt [47,65,115], prefetch [6,131,149], change layout [7,23], memoize [8,40,153,154], or serialize/de-serialize [108] data. We motivate täkō by observing how its onMiss callback enables arbitrary data transformations while improving performance, saving energy, and reducing overall work.…”

Section: Example Program: Lossy Compressionmentioning

confidence: 99%

“…Specialized cache hierarchies. These trends have been widely recognized, and there are many proposals to accelerate data movement, e.g., in machine learning [2,50], graph analytics [92,95,150], data structures [54,58,154], memoization [8,40,153,154], compression [9,36,90,106,107,118,136,146], data layout [7,23,155], prefetching [6,131,149], coherence and synchronization [34,75,151,152], memory management [85,135], and system software [67,108,127]. While highly effective, they share the drawback of requiring custom hardware.…”

Section: Related Workmentioning

confidence: 99%

“…The first programmable memory hierarchies were explored in the '90s and focused on distributed cache coherence [3,23,56,73,113,123]. More recently, designs have added some programmability to the memory hierarchy for specific purposes: e.g., prefetching [6,131] or compression [146]. By contrast, täkō targets a much wider set of features and optimizations by providing a general-purpose interface and architecture to increase software's visibility and control over data movement.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

täkō

Schwedock

Yoovidhya

Seibert

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Current systems hide data movement from software behind the load-store interface. Software's inability to observe and respond to data movement is the root cause of many inefficiencies, including the growing fraction of execution time and energy devoted to data movement itself. Recent specialized memory-hierarchy designs prove that large data-movement savings are possible. However, these designs require custom hardware, raising a large barrier to their practical adoption.This paper argues that the hardware-software interface is the problem, and custom hardware is often unnecessary with an expanded interface. The täkō architecture lets software observe data movement and interpose when desired. Specifically, caches in täkō can trigger software callbacks in response to misses, evictions, and writebacks. Callbacks run on reconfigurable dataflow engines placed near caches. Five case studies show that this interface covers a wide range of data-movement features and optimizations. Microarchitecturally, täkō is similar to recent near-data computing designs, adding ≈5% area to a baseline multicore. täkō improves performance by 1.4×-4.2×, similar to prior custom hardware designs, and comes within 1.8% of an idealized implementation. CCS CONCEPTS• Computer systems organization → Processors and memory architectures.

show abstract

“…Resolving load-dependent control flow operations at NDP precludes the need for using expensive branch resolution mechanisms on the CPU. Moreover, irregular accesses to graph data structures resulting in high memory latency and/or bandwidth [33,54] can be better serviced near memory at a low latency and high available bandwidth, addressing the two main bottlenecks in GPM workloads (Takeaway 3). In summary, NDP is an attractive candidate for accelerating GPM workloads.…”

Section: Why Ndp For Gpm?mentioning

confidence: 99%

NDMiner

Talati

Yang

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

Self Cite

View full text Add to dashboard Cite

Graph Pattern Mining (GPM) algorithms mine structural patterns in graphs. The performance of GPM workloads is bottlenecked by control flow and memory stalls. This is because of data-dependent branches used in set intersection and difference operations that dominate the execution time.This paper first conducts a systematic GPM workload analysis and uncovers four new observations to inform the optimization effort. First, GPM workloads mostly fetch inputs of costly set operations from different memory banks. Second, to avoid redundant computation, modern GPM workloads employ symmetry breaking that discards several data reads, resulting in cache pollution and wasted DRAM bandwidth. Third, sparse pattern mining algorithms perform redundant memory reads and computations. Fourth, GPM workloads do not fully utilize the in-DRAM data parallelism.Based on these observations, this paper presents NDMiner, a Near Data Processing (NDP) architecture that improves the performance of GPM workloads. To reduce in-memory data transfer of fetching data from different memory banks, NDMiner integrates compute units to offload set operations in the buffer chip of DRAM. To alleviate the wasted memory bandwidth caused by symmetry breaking, NDMiner integrates a load elision unit in hardware that detects the satisfiability of symmetry breaking constraints and terminates unnecessary loads. To optimize the performance of sparse pattern mining, NDMiner employs compiler optimizations and maps reduced reads and composite computation to NDP hardware that improves algorithmic efficiency of sparse GPM. Finally, NDMiner proposes a new graph remapping scheme in memory and

show abstract

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Cited by 35 publications

References 94 publications

Crescent: Taming Memory Irregularities for Accelerating Deep Point Cloud Analytics

Crescent: Taming Memory Irregularities for Accelerating Deep Point Cloud Analytics

täkō

NDMiner

Contact Info

Product

Resources

About