QEI: Query Acceleration Can be Generic and Efficient in the Cloud

Yang, Yifan; Wang, Yipeng; Ren, Wei; Chowhury, Rangeen Basu Roy; Tai, Charlie; Kim, Nam Sung

doi:10.1109/hpca51647.2021.00040

Cited by 5 publications

(3 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These works clearly show that RoCE reduces the CPU load (related to network communication) significantly compared to TCP and, at the same time, offers lower communication latency. RoCE communication has been traditionally established over RDMA-capable NICs, i.e., the main focus has been to build such hardware, e.g., [28]. Although Soft-RoCE has been developed to enable hardware-independent RDMA communication, it has not received much attention for industrial use.…”

Section: Distributed Automotive Applicationmentioning

confidence: 99%

RDMA-Based Deterministic Communication Architecture for Autonomous Driving

Abaza,

Habishyashi,

Roy

et al. 2023

2023 IEEE 29th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)

View full text Add to dashboard Cite

Autonomous driving is a big challenge for nextgeneration vehicles and requires multiple computationallyintensive deep neural networks (DNNs) to be implemented on distributed automotive platforms. Distributed software-enabling autonomous functionalities-has strict timing requirements, e.g., low and deterministic end-to-end latency. Such timings rely on the communication technologies used in the automotive platform, as much on the computation performance of CPUs, GPUs, TPUs, and FPGAs. Hence, we advocate the use of Remote Direct Memory Access (RDMA) technology-typically used in data centers-in automotive platforms. As shown by our experiments with real hardware, Soft-RoCE (software implementation of RDMA) offers low latency communication because of minimal CPU involvement and reduced memory copies. Simultaneously, we show that the native implementation of RDMA does not support determinism, i.e., there is a high variation in communication delays in the presence of interfering data packets. To mitigate this issue, we propose a multi-layer communication stack comprising a deterministic scheduler on top of the Soft-RoCE layer. Further, we have developed a C++ library that offers easy-to-use communication interfaces for distributed applications while implementing the proposed architecture. Experiments show that our library (i) reduces the end-to-end latency of distributed object detection by nearly 9% while having an implementation overhead of less than 1.5% and (ii) minimizes the effects of other data traffic on the delay in high-priority communication.

show abstract

Section: Distributed Automotive Applicationmentioning

confidence: 99%

RDMA-Based Deterministic Communication Architecture for Autonomous Driving

Abaza,

Habishyashi,

Roy

et al. 2023

2023 IEEE 29th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)

View full text Add to dashboard Cite

show abstract

“…First, a (de)serializer can be optionally used, if the application uses an RPC protocol for inter-machine communications [90]. Then, to process requests, we typically need a data structure walker [52,86,105,173,176] to find the location of the target data of the request. To maximize the memory-level parallelism and hide the memory access latency, multiple outstanding requests and out-of-order execution should be supported.…”

Section: Orca Cc-accelerator Architecturementioning

confidence: 99%

“…In CPU and Smart NIC, batching means processing requests in a batch to improve the memory access efficiency [99]. In ORCA, since the APU can already exploit the memory-level parallelism across requests [86,105,173,176], there is no need for request batching. Hence, we batch the doorbell signals to the RNIC [77] when posting RDMA operations for response.…”

Section: B In-memory Key-value Storementioning

confidence: 99%

ORCA: A Network and Architecture Co-design for Offloading us-scale Datacenter Applications

Yang¹,

Huang²,

Sun³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Responding to the "datacenter tax" and "killer microseconds" problems for datacenter applications, diverse solutions including Smart NIC-based ones have been proposed. Nonetheless, they often suffer from high overhead of communications over network and/or PCIe links. To tackle the limitations of the current solutions, this paper proposes ORCA, a holistic network and architecture co-design solution that leverages current RDMA and emerging cache-coherent off-chip interconnect technologies. Specifically, ORCA consists of four hardware and software components: (1) unified abstraction of inter-and intra-machine communications managed by one-sided RDMA write and cache-coherent memory write; (2) efficient notification of requests to accelerators assisted by cache coherence; (3) cache-coherent accelerator architecture directly processing requests received by NIC; and (4) adaptive device-tohost data transfer for modern server memory systems consisting of both DRAM and NVM exploiting state-of-the-art features in CPUs and PCIe. We prototype ORCA with a commercial system and evaluate three popular datacenter applications: in-memory key-value store, chain replication-based distributed transaction system, and deep learning recommendation model inference. The evaluation shows that ORCA provides 30.1∼69.1% lower latency, up to 2.5× higher throughput, and ∼ 3× higher power efficiency than the current state-of-the-art solutions.

show abstract