Abstract:This is the accepted version of a paper published in IEEE Communications Surveys and Tutorials. This paper has been peer-reviewed but does not include the final publisher proofcorrections or journal pagination.
“…Moreover, Intel and other vendors might consider introducing a new processor mode in which the hash function is known, the granularity of chunks are increased (e.g., 4 kB pages), or is even programmable. Considering the need for hardware changes in the future data centers [57], we hope this paper will encourage hardware vendors to adopt one or more of these alternatives.…”
In modern (Intel) processors, Last Level Cache (LLC) is divided into multiple slices and an undocumented hashing algorithm (aka Complex Addressing) maps different parts of memory address space among these slices to increase the effective memory bandwidth. After a careful study of Intel's Complex Addressing, we introduce a sliceaware memory management scheme, wherein frequently used data can be accessed faster via the LLC. Using our proposed scheme, we show that a key-value store can potentially improve its average performance ∼12.2% and ∼11.4% for 100% & 95% GET workloads, respectively. Furthermore, we propose CacheDirector, a network I/O solution which extends Direct Data I/O (DDIO) and places the packet's header in the slice of the LLC that is closest to the relevant processing core. We implemented CacheDirector as an extension to DPDK and evaluated our proposed solution for latency-critical applications in Network Function Virtualization (NFV) systems. Evaluation results show that CacheDirector makes packet processing faster by reducing tail latencies (90-99 t h percentiles) by up to 119 µs (∼21.5%) for optimized NFV service chains that are running at 100 Gbps. Finally, we analyze the effectiveness of slice-aware memory management to realize cache isolation.
“…Moreover, Intel and other vendors might consider introducing a new processor mode in which the hash function is known, the granularity of chunks are increased (e.g., 4 kB pages), or is even programmable. Considering the need for hardware changes in the future data centers [57], we hope this paper will encourage hardware vendors to adopt one or more of these alternatives.…”
In modern (Intel) processors, Last Level Cache (LLC) is divided into multiple slices and an undocumented hashing algorithm (aka Complex Addressing) maps different parts of memory address space among these slices to increase the effective memory bandwidth. After a careful study of Intel's Complex Addressing, we introduce a sliceaware memory management scheme, wherein frequently used data can be accessed faster via the LLC. Using our proposed scheme, we show that a key-value store can potentially improve its average performance ∼12.2% and ∼11.4% for 100% & 95% GET workloads, respectively. Furthermore, we propose CacheDirector, a network I/O solution which extends Direct Data I/O (DDIO) and places the packet's header in the slice of the LLC that is closest to the relevant processing core. We implemented CacheDirector as an extension to DPDK and evaluated our proposed solution for latency-critical applications in Network Function Virtualization (NFV) systems. Evaluation results show that CacheDirector makes packet processing faster by reducing tail latencies (90-99 t h percentiles) by up to 119 µs (∼21.5%) for optimized NFV service chains that are running at 100 Gbps. Finally, we analyze the effectiveness of slice-aware memory management to realize cache isolation.
“…It is important for the transmission links and the optical interconnects to meet the requirements in terms of latency and bandwidth. Table 1 lists the latency and bandwidth requirements of three major types of resource interconnects in a modern integrated server [3] [6]. It can be seen that for storage and NIC related communications, the latency requirement is in the scale of microseconds (or even longer), and the bandwidth requirement is less or equal to 10 Gb/s.…”
Section: Network Requirements Of Communication Between Resourcesmentioning
confidence: 99%
“…Each integrated server has a fixed amount of resources (e.g., a HP ProLiant BL660c Gen8 blade server with 8 cores CPU, 16 GB memory, 600 GB hard drive, 1 Gb/s Ethernet NIC). Such a static hardware configuration leads to 'resource stranding' [3], i.e., a server that has used up one type of resource cannot carry out more workload even though there is still a big amount of leftover of other types of resources. For example, a computing-intensive task like video processing may consume all the CPU resources in a server, and memory in the same sever cannot be assigned to the other tasks.…”
mentioning
confidence: 99%
“…Instead, DC operator has to discard old servers and buy new ones. It may cause high cost for maintenance and upgrade, and also postpone to adopt new-generation hardware [3].…”
Resource utilization of modern data centers is significantly limited by the mismatch between the diversity of the resources required by running applications and the fixed amount of hardwired resources (e.g., number of central processing unit CPU cores, size of memory) in the server blades. In this regard, the concept of function disaggregation is introduced, where the integrated server blades containing all types of resources are replaced by the resource blades including only one specific function. Therefore, disaggregated data centers can offer high flexibility for resource allocation and hence their resource utilization can be largely improved. In addition, introducing function disaggregation simplifies the system upgrade, allowing for a quick adoption of new generation components in data centers. However, the communication between different resources faces severe problems in terms of latency and transmission bandwidth required. In particular, the CPU-memory interconnects in fully disaggregated data centers require ultra-low latency and ultrahigh transmission bandwidth in order to prevent performance degradation for running applications. Optical fiber communication is a promising technique to offer high capacity and low latency, but it is still very challenging for the state-of-the-art optical transmission technologies to meet the requirements of the fully disaggregated data centers. In this paper, different levels of function disaggregation are investigated. For the fully disaggregated data centers, two architectural options are presented, where optical interconnects are necessary for CPU-memory communications. We review the state-of-the-art optical transmission technologies and carry out performance assessment when employing them to support function disaggregation in data centers. The results reveal that function disaggregation does improve the efficiency of resource usage in the data centers, although the bandwidth provided by the state-of-the-art optical transmission technologies is not always sufficient for the fully disaggregated data centers. It calls for research in optical transmission to fully utilize the advantages of function disaggregation in data centers. I. INTRODUCTION Cloud computing is one of the major services provided by modern data centers (DCs), where users are able to freely choose resources and operating systems (OSs) for running their applications without considering the underlying
“…Given that cloud operators bill per-vCPU, scaling applications at a fine granularity of time reduces costs or at least allows better collocation. For instance, performing balancing at high speed allows Software Defined "Hardware" Infrastructure (SDHI) [53] to quickly free available cores. Once the utilization of a set of machines has decreased beyond a certain point, one may even begin deactivating some of these machines.…”
While the current literature typically focuses on load-balancing among multiple servers, in this paper, we demonstrate the importance of load-balancing within a single machine (potentially with hundreds of CPU cores). In this context, we propose a new load-balancing technique (RSS++) that dynamically modifies the receive side scaling (RSS) indirection table to spread the load across the CPU cores in a more optimal way. RSS++ incurs up to 14x lower 95 th percentile tail latency and orders of magnitude fewer packet drops compared to RSS under high CPU utilization. RSS++ allows higher CPU utilization and dynamic scaling of the number of allocated CPU cores to accommodate the input load while avoiding the typical 25% over-provisioning. RSS++ has been implemented for both (i) DPDK and (ii) the Linux kernel. Additionally, we implement a new state migration technique which facilitates sharding and reduces contention between CPU cores accessing per-flow data. RSS++ keeps the flowstate by groups that can be migrated at once, leading to a 20% higher efficiency than a state of the art shared flow table.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.