Using SimPoint for accurate and efficient simulation

Perelman, Erez; Hamerly, Greg; Biesbrouck, Michael Van; Sherwood, Timothy; Calder, Brad

doi:10.1145/781064.781076

Cited by 25 publications

(27 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The detailed simulation parameters can be found in Table 2. We collect SimPoint [24] traces from 16 memory intensive SPEC CPU2006 [1] applications, 3 server workloads from CloudSuite [7], and one machine learning workload trace from mlpack [3] that does collaborative filtering on real world data sets [8]. Since our SimPoint methodology does not work with the server workloads (CloudSuite and mlpack), we instead collect the server workload traces after fast-forwarding at least 30B instructions to get past the benchmark's initialization phase.…”

Section: Methodsmentioning

confidence: 99%

Kill the Program Counter

Kim

Teran

Gratz

et al. 2017

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

Data prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the shared Last-Level Cache (LLC). Similarly, most replacement policies do not consider demand and prefetch requests as different types of requests. In particular, program counter (PC)-based replacement policies cannot learn from prefetch requests since the data prefetcher does not generate a PC value. PC-based policies can also be negatively affected by compiler optimizations. In this paper, we propose a holistic cache management technique called Kill-the-PC (KPC) that overcomes the weaknesses of traditional prefetching and replacement policy algorithms. KPC cache management has three novel contributions. First, a prefetcher which approximates the future use distance of prefetch requests based on its prediction confidence. Second, a simple replacement policy provides similar or better performance than current state-of-the-art PC-based prediction using global hysteresis. Third, KPC integrates prefetching and replacement policy into a whole system which is greater than the sum of its parts. Information from the prefetcher is used to improve the performance of the replacement policy and vice-versa. Finally, KPC removes the need to propagate the PC through entire on-chip cache hierarchy while providing a holistic cache management approach with better performance than state-of-the-art PC-, and non-PC-based schemes. Our evaluation shows that KPC provides 8% better

show abstract

Section: Methodsmentioning

confidence: 99%

Kill the Program Counter

Kim

Teran

Gratz

et al. 2017

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

show abstract

“…Sniper supports PinPoint (Patil et al 2004), which is the SimPoint methodology (Sherwood et al 2002;Perelman et al 2003) using the Intel Pin tool (Luk et al 2005). A single 250 million instruction PinPoint (Pinball), which is a representative and repeatable program region, is identified for each Spec2006 benchmark for simulation.…”

Section: Benchmarksmentioning

confidence: 99%

Cooperative Multi-Agent Reinforcement Learning-Based Co-optimization of Cores, Caches, and On-chip Network

Jain

Panda

Subramoney

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Modern multi-core systems provide huge computational capabilities, which can be used to run multiple processes concurrently. To achieve the best possible performance within limited power budgets, the various system resources need to be allocated effectively. Any mismatch between runtime resource requirement and allocation leads to a sub-optimal energy-delay product (EDP). Different optimization techniques exist for addressing the problem of mismatch between the dynamic requirement and runtime allocation of the system resources. Choosing between multiple optimizations at runtime is complex due to the non-additive effects, making the scenario suitable for the application of machine learning techniques. We present a novel method, Machine Learned Machines (MLM), by using online reinforcement learning (RL) to perform dynamic partitioning of the last level cache (LLC), along with dynamic voltage and frequency scaling (DVFS) of the core and uncore (interconnection network and LLC). We have proposed and evaluated three different MLM co-optimization techniques based on independent and cooperative multi-agent learners. We show that the co-optimization results in a much lower system EDP than any of the techniques applied individually. We explore various RL models targeted toward optimization of different system metrics and study their effects on a system EDP, system throughput (STP), and Fairness. The various proposed techniques have been extensively evaluated with a mix of 20 workloads on a 4-core system using Spec2006 benchmarks. We have further evaluated our cooperative MLM techniques on a 16-core system. The results show an average of 20.5% and 19.1% system EDP improvement on a 4-core and 16-core system, respectively, with limited degradation of STP and Fairness. Europe (DATE-2016) conference with the title "Machine Learned Machines: Adaptive Co-optimization of Caches, Cores, and On-chip Network" in March 2016. This extension has explored four additional co-optimization models. Two of the additional models are extensions of the DATE-2016 proposal, while two models are novel co-optimization models based on cooperative learning among the multiple agents. The two cooperative learner-based co-optimization techniques, coMLM and JMLM, are shown to scale well to higher core counts by evaluating it on a 16-core system, which exhibits 19.1% system EDP improvement. INTRODUCTIONA multi-core system is expected to run different programs simultaneously with their own runtime resource requirements. These resource requirements vary as the program executes, and any mismatch in the resource requirement and allocation leads to sub-optimal performance and power. Various proposed optimization techniques address this problem of resource mismatch.Multiprocessor systems-on-chip (MPSoC) have been witnessing increasing core counts to enable higher computational capabilities. The increasing core count leads to an increase in the shared resources such as Last Level Cache (LLC) and interconnects. The increasing hardware size and complexity poses mu...

show abstract

“…Sampled Simulation. Sampled simulation is used extensively to reduce simulation time, e.g., SMARTS [14], or by the selection of representative samples, e.g., SimPoint needed [12,13,10]. The value of knowing which part of an application to simulate has even made its way into benchmarks, such as with PARSEC benchmark [2], where the applications indicate the regions of interest themselves.…”

Section: Related Workmentioning

confidence: 99%

Adaptive Cache Warming for Faster Simulations

Borgström

Sembrant

Black-Schaffer

2017

Proceedings of the 9th Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools

View full text Add to dashboard Cite

The use of hardware-based virtualization allows modern simulators to very quickly fast-forward between sample points and regions of interest. This dramatically reduces the simulation time compared to traditional functional forwarding. However, as the fast-forwarding takes place through virtualized execution on the native hardware, it is unable to warm simulated structures, such as caches. As a result, sampled simulations taking advantage of virtualization for fast-forwarding find their execution time dominated by functional warming.To address the cost of warming, we present Adaptive Cache Warming (ACW), a new fast method that determines how much warming each sample/phase/application needs. ACW takes advantage of the virtualization-based fast-forwarding to search for the minimum warming time required during simulation. To determine when the cache is sufficiently warm, ACW uses heuristics based on the last-level cache's cold-set misses.Our results show that typical practice of conservatively warming last-level caches for around 100M instructions is a vast overkill for nearly all checkpoints. By using ACW, we can adapt the warming per-sample and speedup the simulation by 92-103× on average (512× speedup maximum) depending on cache size (2-32MB).

show abstract

Using SimPoint for accurate and efficient simulation

Cited by 25 publications

References 0 publications

Kill the Program Counter

Kill the Program Counter

Cooperative Multi-Agent Reinforcement Learning-Based Co-optimization of Cores, Caches, and On-chip Network

Adaptive Cache Warming for Faster Simulations

Contact Info

Product

Resources

About