Increasing transistor density enables adding more on-die cache real-estate. However, devoting more space to the shared lastlevel-cache (LLC) causes the memory latency bottleneck to move from memory access latency to shared cache access latency. As such, applications whose working set is larger than the smaller caches spend a large fraction of their execution time on shared cache access latency. To address this problem, this paper investigates increasing the size of smaller private caches in the hierarchy as opposed to increasing the shared LLC. Doing so improves average cache access latency for workloads whose working set fits into the larger private cache while retaining the benefits of a shared LLC. The consequence of increasing the size of private caches is to relax inclusion and build exclusive hierarchies. Thus, for the same total caching capacity, an exclusive cache hierarchy provides better cache access latency.We observe that server workloads benefit tremendously from an exclusive hierarchy with large private caches. This is primarily because large private caches accommodate the large code workingsets of server workloads. For a 16-core CMP, an exclusive cache hierarchy improves server workload performance by 5-12% as compared to an equal capacity inclusive cache hierarchy. The paper also presents directions for further research to maximize performance of exclusive cache hierarchies.
TODAY, THERE ARE MANY competingideas about how to implement multiprocessor systems. Although some of these ideas have been prototyped in hardware, hardware prototypes take too long to build and are very expensive. Often, by the time a hardware prototype really works, it is obsolete. First, the prototype's absolute speed is no longer on a par with current hardware. Second, the technology trade-offs among components change, so that performance results obtained on the prototype become meaningless. Third, the new architecture ideas embodied in the prototype may become irrelevant. Moreover, hardware prototypes are often hard to observe. By contrast, software simulations are very flexible, observable, and relatively inexpensive to develop. However, software simulations often force a trade-off between speed and realism.Hardware emulation using FPGAs (fieldprogrammable gate arrays) 1 is an intermediate approach between software simulation and hardware prototyping. We adopted this approach in a multiprocessor emulator called RPM (Rapid Prototyping Engine for Multiprocessor Systems). Because of its flexibility, the RPM hardware can adapt during its lifetime to the rapid evolution of technology trade-offs and new architectural ideas. RPM is also much more observable than typical hardware prototypes. RPM-2, the second RPM implementation, is up and running. Our first RPM-2 prototype is a cachecoherent nonuniform memory-access (CC-NUMA) multiprocessor. 2
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.