Make the Most out of Last Level Cache in Intel Processors

Farshin, Alireza; Roozbeh, Amir; Maguire, Gerald Q.; Kostić, Dejan

doi:10.1145/3302424.3303977

Cited by 48 publications

(20 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CacheDirector [60] improves RPC latency by modifying DDIO to steer the header of each network packet into the LLC tile closest to the core that will process the packet. We go further by steering the whole packet all the way into the core's L1 cache.…”

Section: Related Workmentioning

confidence: 99%

The NEBULA RPC-Optimized Architecture

Sutherland¹,

Gupta²,

Falsafi³

et al. 2020

2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)

View full text Add to dashboard Cite

Large-scale online services are commonly structured as a network of software tiers, which communicate over the datacenter network using RPCs. Ongoing trends towards software decomposition have led to the prevalence of tiers receiving and generating RPCs with runtimes of only a few microseconds. With such small software runtimes, even the smallest latency overheads in RPC handling have a significant relative performance impact. In particular, we find that growing network bandwidth introduces queuing effects within a server's memory hierarchy, considerably hurting the response latency of fine-grained RPCs. In this work we introduce NEBULA, an architecture optimized to accelerate the most challenging microsecond-scale RPCs, by leveraging two novel mechanisms to drastically improve server throughput under strict tail latency goals. First, NEBULA reduces detrimental queuing at the memory controllers via hardware support for efficient in-LLC network buffer management. Second, NEBULA's network interface steers incoming RPCs into the CPU cores' L1 caches, improving RPC startup latency. Our evaluation shows that NEBULA boosts the throughput of a state-of-the-art keyvalue store by 1.25-2.19x compared to existing proposals, while maintaining strict tail latency goals.

show abstract

Section: Related Workmentioning

confidence: 99%

The NEBULA RPC-Optimized Architecture

Sutherland¹,

Gupta²,

Falsafi³

et al. 2020

2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)

View full text Add to dashboard Cite

show abstract

“…The detail of the hash function is usually undocumented, while the mapping is known to be conducted by a calculation based on a particular part of the physical memory address of a data or a request. Thus several studies reverse-engineered the hash function of recent CPUs [29], [30]. Figure 7 shows an example of memory address mapping to LLC slices.…”

Section: Architecture Of Last-level-cachementioning

confidence: 99%

“…Regarding this problem, LLC slices are also becoming one of the computing resources as well as the other resources such as the CPU cores and memory capacity. For the assignment of the LLC slices to each CPU core that runs a certain process, several slice-aware memory management technologies such as Intel Cache Allocation Technology (CAT) [28] are becoming popular in the operation of NFV infrastructure [29]. Figure 8 shows the proposed architecture.…”

Section: Architecture Of Last-level-cachementioning

confidence: 99%

“…Thus, in the simulations, we assume that the processing times in L1/L2 caches follow exponential distributions. According to [29], the access time to LLC slice varies depending on the distance between a CPU core and an LLC slice. In the simulations, we assume that the processing time of LLC slices follows an exponential distribution.…”

Section: B System Assumptionmentioning

confidence: 99%

See 1 more Smart Citation

Packet Processing Architecture Using Last-Level-Cache Slices and Interleaved 3D-Stacked DRAM

Korikawa

Kawabata

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Packet processing performance of Network Function Virtualization (NFV)-aware environment depends on the memory access performance of commercial-off-the-shelf (COTS) hardware systems. Table lookup is a typical example of packet processing, which has a significant dependence on memory access performance. Thus, the on-chip cache memories of the CPU are becoming more and more critical for many high-performance software routers or switches. Moreover, in the carrier network, multiple applications run on top of the same hardware system in parallel, which requires the capacity of cache memories. In this paper, we propose a packet processing architecture that enhances memory access parallelism by combining on-chip last-level-cache (LLC) slices and off-chip interleaved 3 Dimensional (3D)-stacked Dynamic Random Access Memory (DRAM) devices. Table entries are stored in the off-chip 3D-stacked DRAM, so that memory requests are processed in parallel by using bank interleaving and channel parallelism. Also, cached entries are distributed to on-chip LLC slices according to a memory address-based hash function so that each CPU core can access on-chip LLC in parallel. The evaluation results show that the proposed architecture reduces the memory access latency by 62 % and 12 % and increases the throughput by 108 % and 2 % with reducing blocking probability of memory requests 96 % and 50 %, compared to the architecture with on-chip shared LLC and that without on-chip LLC, respectively.

show abstract

“…RSS++ could also exploit Non-Uniform Cache Access (NUCA) awareness. Our algorithm could be augmented using the technique of [11] to re-assign the buckets of each overloaded core, first to a collocated hardware-thread and then assigned based upon the transfer times between cores.…”

Section: Numa and Nuca Awarenessmentioning

confidence: 99%

Rss++

Barbette

Katsikas

Maguire

et al. 2019

Proceedings of the 15th International Conference on Emerging Networking Experiments and Technologies

Self Cite

View full text Add to dashboard Cite

While the current literature typically focuses on load-balancing among multiple servers, in this paper, we demonstrate the importance of load-balancing within a single machine (potentially with hundreds of CPU cores). In this context, we propose a new load-balancing technique (RSS++) that dynamically modifies the receive side scaling (RSS) indirection table to spread the load across the CPU cores in a more optimal way. RSS++ incurs up to 14x lower 95 th percentile tail latency and orders of magnitude fewer packet drops compared to RSS under high CPU utilization. RSS++ allows higher CPU utilization and dynamic scaling of the number of allocated CPU cores to accommodate the input load while avoiding the typical 25% over-provisioning. RSS++ has been implemented for both (i) DPDK and (ii) the Linux kernel. Additionally, we implement a new state migration technique which facilitates sharding and reduces contention between CPU cores accessing per-flow data. RSS++ keeps the flowstate by groups that can be migrated at once, leading to a 20% higher efficiency than a state of the art shared flow table.

show abstract

Make the Most out of Last Level Cache in Intel Processors

Cited by 48 publications

References 54 publications

The NEBULA RPC-Optimized Architecture

The NEBULA RPC-Optimized Architecture

Packet Processing Architecture Using Last-Level-Cache Slices and Interleaved 3D-Stacked DRAM

Rss++

Contact Info

Product

Resources

About