Toward QoS-Awareness and Improved Utilization of Spatial Multitasking GPUs

Zhang, Wei; Chen, Quan; Zheng, Ninxing; Cui, Weihao; Fu, Kaihua; Guo, Minyi

doi:10.1109/tc.2021.3064352

Cited by 22 publications

(22 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several prior works target similar problems. Laius [80] manages the SM allocation to ensure the QoS of a single application, while the performance of the co-located low priority applications can be sacri ced. It is not applicable for GPU microservices through simple adaption for three reasons.…”

Section: De Ciencies Of Prior Workmentioning

confidence: 99%

“…The shared resource usage is used to quantify the runtime contention between microservices on a GPU since only the SMs can be explicitly allocated. We also use the process pool technique [80] to enable dynamic SM allocation.…”

Section: Astraea Runtime Systemmentioning

confidence: 99%

“…There are some research works on managing the latency of independent GPU applications [25,28,29,43,80,81]. Clipper [28] adaptively batched the requests to improve the GPU throughput while maintaining tail latency.…”

Section: Introductionmentioning

confidence: 99%

“…ClockWork [43] presented a runtime time-sharing scheduler for DL requests. Baymax [25] and Laius [80] ensured the QoS of a user-facing service when it co-runs with beste ort applications on time-sharing GPUs and spatial-sharing GPUs, respectively. However, ensuring the short latency of a single-stage may increase the latency of other stages due to the limited shared resources.…”

Section: Introductionmentioning

confidence: 99%

“…We implement Astraea and evaluate it on a GPU server with two Nvidia 2080Ti GPUs, and three DGX-2 machines with Nvidia V100 GPUs. Astraea e ectively increases the supported peak load by up to 82.3% compared with Laius [80] and 45.1% compared with FIRM [63], while ensuring the required QoS.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Astraea: towards QoS-aware and resource-efficient multi-stage GPU services

Zhang

Chen

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Self Cite

View full text Add to dashboard Cite

Multi-stage user-facing applications on GPUs are widely-used nowadays, and are often implemented to be microservices. Prior research works are not applicable to ensuring QoS of GPU-based microservices due to the di erent communication patterns and shared resource contentions. We propose Astraea to manage GPU microservices considering the above factors. In Astraea, a microservice deployment policy is used to maximize the supported peak service load while ensuring the required QoS. To adaptively switch the communication methods between microservices according to di erent deployments, we propose an auto-scaling GPU communication framework. The framework automatically scales based on the currently used hardware topology and microservice location, and adopts global memory-based techniques to reduce intra-GPU communication. Astraea increases the supported peak load by up to 82.3% while achieving the desired 99%-ile latency target compared with state-of-the-art solutions. CCS CONCEPTS• Computer systems organization → Cloud computing; Neural networks; • Networks → Cloud computing.

show abstract

Section: De Ciencies Of Prior Workmentioning

confidence: 99%

Section: Astraea Runtime Systemmentioning

confidence: 99%