A multi-banked shared-l1 cache architecture for tightly coupled processor clusters

Kakoee, Mohammad Reza; Petrović, Vladimir; Benini, Luca

doi:10.1109/issoc.2012.6376362

Cited by 7 publications

(3 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On and Hussin [19] analysed the impact that different many‐core clustering methods have on multiprocessing architectures. To improve performance, Kakoee et al [20] proposed a shared‐L1 cache architecture for tightly coupled processor clusters. These works demonstrate that memory access latencies differ strongly in such architectures, depending on the data locality on the clusters.…”

Section: Literature Reviewmentioning

confidence: 99%

Exploiting memory allocations in clusterised many‐core architectures

Garibotti

Ost

Butko

et al. 2019

IET Computers & Digital Techniques

View full text Add to dashboard Cite

Power-efficient architectures have become the most important feature required for future embedded systems. Modern designs, like those released on mobile devices, reveal that clusterization is the way to improve energy efficiency. However, such architectures are still limited by the memory subsystem (i.e., memory latency problems). This work investigates an alternative approach that exploits on-chip data locality to a large extent, through distributed shared memory systems that permit efficient reuse of on-chip mapped data in clusterized many-core architectures. First, this work reviews the current literature on memory allocations and explore the limitations of cluster-based many-core architectures. Then, several memory allocations are introduced and benchmarked scalability, performance and energy-wise, compared to the conventional centralized shared memory solution to reveal which memory allocation is the most appropriate for future mobile architectures. Our results show that distributed shared memory allocations bring performance gains and opportunities to reduce energy consumption.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Exploiting memory allocations in clusterised many‐core architectures

Garibotti

Ost

Butko

et al. 2019

IET Computers & Digital Techniques

View full text Add to dashboard Cite

show abstract

“…The work of Rahimi et al is extended in [19] by a controllable pipeline stage between the CPUs and memory banks to be more reliable and variation-tolerant. In [10] a shared L1 data cache is presented. Using the logarithmic interconnect network proposed by Rahimi et al, the best-case read latency is one clock cycle.…”

Section: I R E L At E D W O R Kmentioning

confidence: 99%

“…Streaming applications are characterized by continuous processing of a data stream via many different tasks. Due to the static data-flow between these tasks, the CoreVA-MPSoC uses software-managed scratchpad memories instead of caches, as they are used in [7], [10], [13], [16], and [17]. In contrast to the Epiphany [8], our CoreVAMPSoC features a hierarchical communication infrastructure.…”

Section: I R E L At E D W O R Kmentioning

confidence: 99%

CoreVA-MPSoC: A Many-Core Architecture with Tightly Coupled Shared and Local Data Memories

Sievers²,

Daberkow

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-MPSoCs with hierarchical communication infrastructures are promising architectures for low power embedded systems. Multiple CPU clusters are coupled using an Network-onChip (NoC). Our CoreVA-MPSoC targets streaming applications in embedded systems, like signal and video processing. In this work we introduce a tightly coupled shared data memory to each CPU cluster, which can be accessed by all CPUs of a cluster and the NoC with low latency. The main focus is the comparison of different memory architectures and their connection to the NoC. We analyze memory architectures with local data memory only, shared data memory only, and a hybrid architecture integrating both. Implementation results are presented for a 28 nm FD-SOI standard cell technology. A CPU cluster with shared memory shows similar area requirements compared to the local memory architecture. We use post place and route simulations for precise analysis of energy consumption on both cluster and NoC level using the different memory architectures. An architecture with shared data memory shows best performance results in combination with a high resource efficiency. On average, the use of shared memory shows a 17.2% higher throughput for a benchmark suite of 10 applications compared to the use of local memory only. the communication infrastructure goes the on-chip memory architecture, which also has a huge impact on performance and energy efficiency. The main focus of this paper is the comparison of different memory architectures and their interaction with the NoC for many core systems. Compared to traditional processor systems, lots of many cores feature a different memory management, which changes the requirements on memory and NoC infrastructure. Traditional processor systems use a memory hierarchy with several (private and shared) on-chip caches, external DRAM, and a unified address space. This allows for easy programming, but results in unpredictable memory access times. Additionally, the cache logic and the coherence handling require a high amount of chip area and power. Therefore, a lot of Many-Core systems omit data caches and use software-managed scratchpad memories instead, which provide a resource-efficient alternative [1]. For performance reasons, the scratchpad memories are tightly attached to each CPU and communication between CPUs is initiated by software. In [2] we showed that area and power consumption of a single CoreVA CPU's data memory increases by 10%, when using a cache instead of scratchpad memory. Due to cache coherence issues it can be expected that these values will even increase for a cache-based many core system. Additionally, software-managed scratchpad memories gives full control of data communication to the programmer or an automatic partitioning tool (cf. Section III-E) and allows for a more accurate performance estimation.The many core architecture considered in this work is our CoreVA-MPSoC, which targets streaming applications in embedded and energy-limited systems. Examples for streaming applications are signal pr...

show abstract

Comparison of Shared and Private L1 Data Memories for an Embedded MPSoC in 28nm FD-SOI

Sievers

Daberkow

et al. 2015

2015 IEEE 9th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip

View full text Add to dashboard Cite

A multi-banked shared-l1 cache architecture for tightly coupled processor clusters

Cited by 7 publications

References 7 publications

Exploiting memory allocations in clusterised many‐core architectures

Exploiting memory allocations in clusterised many‐core architectures

CoreVA-MPSoC: A Many-Core Architecture with Tightly Coupled Shared and Local Data Memories

Comparison of Shared and Private L1 Data Memories for an Embedded MPSoC in 28nm FD-SOI

Contact Info

Product

Resources

About