Scale-out processors

Lotfi-Kamran,; Grot, Boris; Ferdman,; Volos,; Kocberber,; Picorel,; Adileh,; Jevdjic,; Idgunji,; Ozer,; Falsafi,

doi:10.1109/isca.2012.6237043

Cited by 90 publications

(36 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A number of studies have attempted to address the overheads of shared cache latency by advocating private cache implementations [9,10,30,12,28]. A number of these studies recognize that private caches can significantly degrade performance compared to shared caches.…”

Section: Related Workmentioning

confidence: 99%

“…The interconnect latency can be tackled by removing the interconnection network entirely and designing private LLCs or a hybrid of private and shared LLC [9,10,30,12,28]. While hybrid LLCs provide the capacity benefits of a shared cache with the latency benefits of a private cache, they still suffer from the added L2 miss latency when the application working set is larger than the available L2 cache.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches

Jaleel

Nuzman

Moga

et al. 2015

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Increasing transistor density enables adding more on-die cache real-estate. However, devoting more space to the shared lastlevel-cache (LLC) causes the memory latency bottleneck to move from memory access latency to shared cache access latency. As such, applications whose working set is larger than the smaller caches spend a large fraction of their execution time on shared cache access latency. To address this problem, this paper investigates increasing the size of smaller private caches in the hierarchy as opposed to increasing the shared LLC. Doing so improves average cache access latency for workloads whose working set fits into the larger private cache while retaining the benefits of a shared LLC. The consequence of increasing the size of private caches is to relax inclusion and build exclusive hierarchies. Thus, for the same total caching capacity, an exclusive cache hierarchy provides better cache access latency.We observe that server workloads benefit tremendously from an exclusive hierarchy with large private caches. This is primarily because large private caches accommodate the large code workingsets of server workloads. For a 16-core CMP, an exclusive cache hierarchy improves server workload performance by 5-12% as compared to an equal capacity inclusive cache hierarchy. The paper also presents directions for further research to maximize performance of exclusive cache hierarchies.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches

Jaleel

Nuzman

Moga

et al. 2015

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…However, it is very unlikely that such an adversarial traffic pattern will occur in real workloads. In addition, a recent study on scale-out processors [29] showed that hierarchical and modular memory hierarchy makes optimal use of die area. Recent processors including SPARC T4 [44] and AMD Bulldozer [3] also have hierarchical memory hierarchy.…”

Section: Worst-case Traffic Pattern Analysismentioning

confidence: 99%

Transportation-network-inspired network-on-chip

Kim

Maeng

et al. 2014

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

A cost-efficient network-on-chip is needed in a scalable many-core systems. Recent multicore processors have leveraged a ring topology and hierarchical ring can increase scalability but presents different challenges, including higher hop count and global ring bottleneck. In this work, we describe a hierarchical ring topology that we refer to as a transportationnetwork-inspired network-on-chip (tNoC) that leverages principles from transportation network systems. In particular, we propose a novel hybrid flow control for hierarchical ring topology to scale the topology efficiently. The flow control is hybrid in that the channels are allocated on flit granularity while the buffers are allocated on packet granularity. The hybrid flow control enables a simplified router microarchitecture (to minimize per-hop latency) as router input buffers are minimized and buffers are pushed to the edges, either at the output ports or at the hub routers that interconnect the local rings to the global ring -while still supporting virtual channels to avoid protocol deadlock. We also describe a packet-quota-system (PQS) and a separate credit network that provide congestion management, support prioritized arbitration in the network, and provide support for multiflit packets. A detailed evaluation of a 64-core CMP shows that the tNoC improves performance by up to 21% compared with a baseline, buffered hierarchical ring topology while reducing NoC energy by 51%.

show abstract

“…• A range of new methods to fairly compare the efficiency of server architectures (Section VI) and scale these architectures on demand to meet workload QoS requirements [6], [7]. NanoStreams advances the state of the art in micro-servers in several ways by: (a) adding application-specific but programmable hardware accelerators to micro-servers, as opposed to existing solutions that use elaborate hardware design flows and target a single algorithm [8]; (b) providing general purpose low latency networking to access accelerators in the datacentre, as opposed to custom fabrics [9]; (c) effectively integrating streaming and accelerator-aware programming models into domain specific software stacks, moving one step ahead of ongoing efforts to unify heterogeneous programming models [10]; (d) significantly improving server energy-efficiency of micro-servers via on demand and QoS-aware scale-out and acceleration [6], [7].…”

Section: Introductionmentioning

confidence: 99%