Near-Stream Computing: General and Transparent Near-Cache Acceleration

Wang, Zhengrong; Weng, Jian; Liu, Sihao; Nowatzki, Tony

doi:10.1109/hpca53966.2022.00032

Cited by 8 publications

(7 citation statements)

References 74 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The second use case that was used to further corroborate the functionality of the proposed NDPmulator framework was the NDAcc presented by Wang et al [47], [48]. Their architecture consists of Processing Element (PE) arrays installed close to the cache to perform arithmetic and logic vector operations, as depicted in Fig.…”

Section: B Near-stream Computingmentioning

confidence: 98%

“…Section III explains the simulation flow of NDPmulator for both SE and FS modes, providing examples of its operation. Section IV briefly describes the NDAccs proposed by Das and Kapoor [46], Wang et al [47], [48], and Genc et al [49] whose architectures were used to validate NDPmulator, and presents as discusses the obtained experimental results. Section V summarizes relevant related work.…”

Section: All In All This Paper Presents the Following Contributionsmentioning

confidence: 99%

“…Furthermore, the device-driver-based communication implemented between the CPU and the NDAcc is highly dependent [47], [48].…”

Section: A Near-data Database Processingmentioning

confidence: 99%

See 2 more Smart Citations

NDPmulator: Enabling Full-System Simulation for Near-Data Accelerators From Caches to DRAM

Vieira,

Roma,

Falcao

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

Section: B Near-stream Computingmentioning

confidence: 98%

Section: All In All This Paper Presents the Following Contributionsmentioning

confidence: 99%

See 1 more Smart Citation

NDPmulator: Enabling Full-System Simulation for Near-Data Accelerators From Caches to DRAM

Vieira,

Roma,

Falcao

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…A dataflow fabric can easily run callbacks in parallel by assigning each a unique tag. Alternatively, täkō could execute callbacks on reserved SMT threads [141,151], but this would either sequentialize callbacks or require multiple, heavy-weight thread contexts. Moreover, constantly re-fetching and decoding the same instructions would be wasteful.…”

Section: Engine Microarchitecturementioning

confidence: 99%

täkō

Schwedock

Yoovidhya

Seibert

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Current systems hide data movement from software behind the load-store interface. Software's inability to observe and respond to data movement is the root cause of many inefficiencies, including the growing fraction of execution time and energy devoted to data movement itself. Recent specialized memory-hierarchy designs prove that large data-movement savings are possible. However, these designs require custom hardware, raising a large barrier to their practical adoption.This paper argues that the hardware-software interface is the problem, and custom hardware is often unnecessary with an expanded interface. The täkō architecture lets software observe data movement and interpose when desired. Specifically, caches in täkō can trigger software callbacks in response to misses, evictions, and writebacks. Callbacks run on reconfigurable dataflow engines placed near caches. Five case studies show that this interface covers a wide range of data-movement features and optimizations. Microarchitecturally, täkō is similar to recent near-data computing designs, adding ≈5% area to a baseline multicore. täkō improves performance by 1.4×-4.2×, similar to prior custom hardware designs, and comes within 1.8% of an idealized implementation. CCS CONCEPTS• Computer systems organization → Processors and memory architectures.

show abstract

“…TaskStream [15] extends the ISA for task parallelism, enabling dynamic reordering of tasks to exploit opportunities for multicasting data shared between tasks. Prior work also adds stream abstractions to CPU ISAs [57][58][59]; the "stream confluence" optimization [59] enables recognizing simultaneous reuse across multiple cores and combines streams dynamically to reduce requests to shared cache and reduce traffic by multicasting. Overall, the realization of Mozart lends credence to the practicality of adopting these ideas in industry.…”

Section: Related Workmentioning

confidence: 99%

The Mozart reuse exposed dataflow processor for AI and beyond

Sankaralingam

Nowatzki

Gangadhar

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

Self Cite

View full text Add to dashboard Cite

In this paper we introduce the Mozart Processor, which implements a new processing paradigm called Reuse Exposed Dataflow (RED). RED is a counterpart to existing execution models of Von-Neumann, SIMT, Dataflow, and FPGA. Dataflow and data reuse are the fundamental architecture primitives in RED, implemented with mechanisms for inter-worker communication and synchronization. The paper defines the processor architecture, the details of the microarchitecture, chip implementation, software stack development, and performance results. The architecture's goal is to achieve near-CPU like flexibility while having ASIC-like efficiency for a large-class of data-intensive workloads. An additional goal was software maturity -have large coverage of applications immediately, avoiding the need for a long-drawn hand-tuning software development phase. The architecture was defined with this software-maturity/compiler friendliness in mind. In short, the goal was to do to GPUs, what GPUs did to CPUs -i.e. be a better solution for a large range of workloads, while preserving flexibility and programmability. The chip was implemented with HBM and PCIe interfaces and taken to production on a 16nm TSMC FFC process. For ML inference tasks with batch-size=4, Mozart is integer factors better than state-of-theart GPUs even while being nearly 2 technology nodes behind. We conclude with a set of lessons learned, the unique challenges of a clean-slate architecture in a commercial setting, and pointers for uncovered research problems. CCS CONCEPTS• Computer systems organization → Data flow architectures;• Hardware → Hardware accelerators.

show abstract

Near-Stream Computing: General and Transparent Near-Cache Acceleration

Cited by 8 publications

References 74 publications

NDPmulator: Enabling Full-System Simulation for Near-Data Accelerators From Caches to DRAM

NDPmulator: Enabling Full-System Simulation for Near-Data Accelerators From Caches to DRAM

täkō

The Mozart reuse exposed dataflow processor for AI and beyond

Contact Info

Product

Resources

About