High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches

Jaleel, Aamer; Nuzman, Joseph; Moga, Adrian; Steely, Simon C.; Emer, Joel

doi:10.1109/hpca.2015.7056045

Cited by 30 publications

(18 citation statements)

References 29 publications

(61 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, the Skylake architecture has support for AVX-512 instructions, more parallel cores, and larger L2 caches. Furthermore, Haswell and Broadwell implement an inclusive L2/L3 cache hierarchy, while Skylake implements an non-inclusive/exclusive cachehierarchy [34,35]. (For the remainder of this paper we will refer to Skylake's L2/L3 cache hierarchy as exclusive).…”

Section: Machinesmentioning

confidence: 99%

The Architectural Implications of Facebook's DNN-Based Personalized Recommendation

Gupta

Wang

et al. 2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

202

182

View full text Add to dashboard Cite

The widespread application of deep learning has changed the landscape of computation in the data center. In particular, personalized recommendation for content ranking is now largely accomplished leveraging deep neural networks. However, despite the importance of these models and the amount of compute cycles they consume, relatively little research attention has been devoted to systems for recommendation. To facilitate research and to advance the understanding of these workloads, this paper presents a set of real-world, productionscale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct indepth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inferences can drastically improve latency-bounded throughput, and the diverse composition of recommendation models leads to different optimization strategies.Preprint. Under submission.

show abstract

Section: Machinesmentioning

confidence: 99%

The Architectural Implications of Facebook's DNN-Based Personalized Recommendation

Gupta

Wang

et al. 2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

202

182

View full text Add to dashboard Cite

show abstract

“…Xiao et al [24] presented a dual queues cache replacement algorithm based on sequentiality detection to improve the cache design. Jaleel et al [23] presented directions for further research to maximize performance of exclusive cache hierarchies. Chou et al [35] proposed CAMEO which not only makes stacked DRAM visible as part of the memory address space but also exploits data locality.…”

Section: Cache Architecture Designmentioning

confidence: 99%

“…Most of recent researches on hybrid SRAM and DRAM caches focus mainly on enhancing the overall performance of SRAM (resp., DRAM) by utilizing the merit of DRAM (resp., SRAM). There are also many papers devoted to investigating workload performance: (1) For multi-programmed workloads, prior work discussed the issues of relieving memory contention [10,11], workload balance [12,13] and power related optimization [14]; (2) To improve the performance of memory-intensive workloads, many solutions (e.g., architecture design [15][16][17], OS level method [18][19][20] and feedback control [21,22]) have also been proposed; (3) In the cache system, the improved cache architectures [4,9,23,24] and 3D-stacked DRAM technologies [25][26][27] are used to achieve better workload performance; and so on (a broader overview of related work will be covered in Section 2). Instead, little attention has been paid to designing a last level cache (LLC) scheduling scheme for multi-programmed workloads with different memory footprints.…”

Section: Introductionmentioning

confidence: 99%

Toward multi-programmed workloads with different memory footprints: a self-adaptive last level cache scheduling scheme

Zhang

Guo

et al. 2017

Sci. China Inf. Sci.

View full text Add to dashboard Cite

With the emerging of 3D-stacking technology, the dynamic random-access memory (DRAM) can be stacked on chips to architect the DRAM last level cache (LLC). Compared with static randomaccess memory (SRAM), DRAM is larger but slower. In the existing research papers, a lot of work has been devoted to improving the workload performance using SRAM and stacked DRAM together, ranging from SRAM structure improvement, to optimizing cache tag and data access. Instead, little attention has been paid to designing an LLC scheduling scheme for multi-programmed workloads with different memory footprints. Motivated by this, we propose a self-adaptive LLC scheduling scheme, which allows us to utilize SRAM and 3D-stacked DRAM efficiently, achieving better workload performance. This scheduling scheme employs (1) an evaluation unit, which is used to probe and evaluate the cache information during the process of programs being executed; and (2) an implementation unit, which is used to self-adaptively choose SRAM or DRAM. To make the scheduling scheme work correctly, we develop a data migration policy. We conduct extensive experiments to evaluate the performance of our proposed scheme. Experimental results show that our method can improve the multi-programmed workload performance by up to 30% compared with the state-of-the-art methods.

show abstract

“…Our work focuses on mitigating the power dissipation caused by the following cache coherence problems: a) Non sequential data fetch: Cache prefetcher fetches data in a sequential manner, 978-1-4799-5341-7/16/$31.00 ©2016 IEEE randomly allocated data causes more cache misses. One way to ensure sequential data fetching is to redesign the cache hierarchy [10]. However, it is difficult to keep data allocation in sequence in SMT CMP architecture, where the context switches regularly and memory gets allocated randomly.…”

Section: A Cache Coherence In Multithreadingmentioning

confidence: 99%

Linux apps-usage-driven power dissipation-aware scheduler

Rex

Jong

Herkersdorf

2016

2016 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

In modern symmetrical chip multiprocessor (CMP) architecture, problems in cache coherence, context switch overheads and serialized code bottleneck are major causes of excessive computing power dissipation in the application of simultaneous multithreading (SMT) technique. This research models and manages above-mentioned problems based on user application usage patterns identified in a mobile computing platform. A novel scheduler has been developed to realize power management schemes based on the Linux kernel (version. 3.0.1) and deployed in Android 4.0 ICS. The scheduler monitors multiple system performance metrics and predicts power dissipation based on the historical user application usage values as well as the content of the scheduler run queue. The length of the time slices and the variables of process control blocks are adjusted to optimize power dissipation according to the prediction. The proposed scheduler module has achieved a power dissipation reduction of 13 to 24% in a GEM5 simulated environment.

show abstract

High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches

Cited by 30 publications

References 29 publications

The Architectural Implications of Facebook's DNN-Based Personalized Recommendation

The Architectural Implications of Facebook's DNN-Based Personalized Recommendation

Toward multi-programmed workloads with different memory footprints: a self-adaptive last level cache scheduling scheme

Linux apps-usage-driven power dissipation-aware scheduler

Contact Info

Product

Resources

About