Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as key-value stores and graph-based applications that rely on large irregular data structures. The fine-grained nature of the accesses is a poor match to commodity networking technologies, including RDMA, which incur delays of 10-1000x over local DRAM operations.We introduce Scale-Out NUMA (soNUMA) -an architecture, programming model, and communication protocol for low-latency, distributed in-memory processing. soNUMA layers an RDMA-inspired programming model directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA relies on the remote memory controller -a new architecturally-exposed hardware block integrated into the node's local coherence hierarchy. Our results based on cycle-accurate full-system simulation show that soNUMA performs remote reads at latencies that are within 4x of local DRAM, can fully utilize the available memory bandwidth, and can issue up to 10M remote memory operations per second per core.
Datacenter operators have started deploying Persistent Memory (PM), leveraging its combination of fast access and persistence for significant performance gains. A key challenge for PM-aware software is to maintain high performance while achieving atomic durability. The latter typically requires the use of logging, which introduces considerable overhead with additional CPU cycles, write traffic, and ordering requirements. In this paper, we exploit the data multiversioning inherent in the memory hierarchy to achieve atomic durability without logging. Our design, LAD, relies on persistent buffering space at the memory controllers (MCs)-already present in modern CPUs-to speculatively accumulate all of a transaction's updates before they are all atomically committed to PM. LAD employs an on-chip distributed commit protocol in hardware to manage the distributed speculative state each transaction accumulates across multiple MCs. We demonstrate that LAD is a practical design relying on modest hardware modifications to provide atomically durable transactions, while delivering up to 80% of ideal-i.e., PM-oblivious software's-performance. CCS CONCEPTS • Computer systems organization → Processors and memory architectures; • Information systems → Storage class memory; Phase change memory.
Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on commodity SoCs, low-latency and highbandwidth communication fabrics and a remote memory access model to enable aggregation of a rack's memory for critical data-intensive applications such as graph processing or key-value stores. Low latency and high bandwidth not only dictate eliminating communication bottlenecks in the software protocols and off-chip fabrics but also a careful on-chip integration of network interfaces. The latter is a key challenge especially in architectures with RDMA-inspired one-sided operations that aim to achieve low latency and high bandwidth through on-chip Network Interface (NI) support. This paper proposes and evaluates network interface architectures for tiled manycore SoCs for in-memory rack-scale computing. Our results indicate that a careful splitting of NI functionality per chip tile and at the chip's edge along a NOC dimension enables a rack-scale architecture to optimize for both latency and bandwidth. Our best manycore NI architecture achieves latencies within 3% of an idealized hardware NUMA and efficiently uses the full bisection bandwidth of the NOC, without changing the on-chip coherence protocol or the core's microarchitecture.
With mainstream technologies to couple logic tightly with memory on the horizon, near-memory processing has re-emerged as a promising approach to improving performance and energy for data-centric computing. DRAM, however, is primarily designed for density and low cost, with a rigid internal organization that favors coarse-grain streaming rather than byte-level random access. This paper makes the case that treating DRAM as a block-oriented streaming device yields significant efficiency and performance benefits, which motivate for algorithm/architecture co-design to favor streaming access patterns, even at the price of a higher order algorithmic complexity. We present the Mondrian Data Engine that drastically improves the runtime and energy efficiency of basic in-memory analytic operators, despite doing more work as compared to traditional CPU-optimized algorithms, which heavily rely on random accesses and deep cache hierarchies
Modern in-memory services rely on large distributed object stores to achieve the high scalability essential to service thousands of requests concurrently. The independent and unpredictable nature of incoming requests results in random accesses to the object store, triggering frequent remote memory accesses. State-of-the-art distributed memory frameworks leverage the one-sided operations offered by RDMA technology to mitigate the traditionally high cost of remote memory access. Unfortunately, the limited semantics of RDMA onesided operations bound remote memory access atomicity to a single cache block; therefore, atomic remote object access relies on software mechanisms. Emerging highly integrated rackscale systems that reduce the latency of one-sided operations to a small multiple of DRAM latency expose the overhead of these software mechanisms as a major latency contributor. This technology-triggered paradigm shift calls for new onesided operations with stronger semantics. We take a step in that direction by proposing SABRes, a new one-sided operation that provides atomic remote object reads in hardware. We then present LightSABRes, a lightweight hardware accelerator for SABRes that removes all atomicity-associated software overheads. Compared to a state-of-the-art software atomicity mechanism, LightSABRes improve the throughput of a microbenchmark atomically accessing 128B-8KB objects from remote memory by 15-97%, and the throughput of a modern in-memory distributed object store by 30-60%.
To provide low latency and high throughput guarantees, most large key-value stores keep the data in the memory of many servers. Despite the natural parallelism across lookups, the load imbalance, introduced by heavy skew in the popularity distribution of keys, limits performance. To avoid violating tail latency service-level objectives, systems tend to keep server utilization low and organize the data in micro-shards, which provides units of migration and replication for the purpose of load balancing. These techniques reduce the skew, but incur additional monitoring, data replication and consistency maintenance overheads.In this work, we introduce RackOut, a memory pooling technique that leverages the one-sided remote read primitive of emerging rack-scale systems to mitigate load imbalance while respecting service-level objectives. In RackOut, the data is aggregated at rack-scale granularity, with all of the participating servers in the rack jointly servicing all of the rack's micro-shards. We develop a queuing model to evaluate the impact of RackOut at the datacenter scale. In addition, we implement a RackOut proof-of-concept key-value store, evaluate it on two experimental platforms based on RDMA and Scale-Out NUMA, and use these results to validate the model. Our results show that RackOut can increase throughput up to 6× for RDMA and 8.6× for Scale-Out NUMA compared to a scale-out deployment, while respecting tight tail latency service-level objectives.
Modern online services come with stringent quality requirements in terms of response time tail latency. Because of their decomposition into fine-grained communicating software layers, a single user request fans out into a plethora of short, µs-scale RPCs, aggravating the need for faster inter-server communication. In reaction to that need, we are witnessing a technological transition characterized by the emergence of hardware-terminated user-level protocols (e.g., In-finiBand/RDMA) and new architectures with fully integrated Network Interfaces (NIs). Such architectures offer a unique opportunity for a new NI-driven approach to balancing RPCs among the cores of manycore server CPUs, yielding major tail latency improvements for µs-scale RPCs. We introduce RPCValet, an NI-driven RPC load-balancing design for architectures with hardware-terminated protocols and integrated NIs, that delivers near-optimal tail latency. RPCValet's RPC dispatch decisions emulate the theoretically optimal single-queue system, without incurring synchronization overheads currently associated with single-queue implementations. Our design improves throughput under tight tail latency goals by up to 1.4×, and reduces tail latency before saturation by up to 4× for RPCs with µs-scale service times, as compared to current systems with hardware support for RPC load distribution. RPCValet performs within 15% of the theoretically optimal single-queue system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.