Multi-stage user-facing applications on GPUs are widely-used nowadays, and are often implemented to be microservices. Prior research works are not applicable to ensuring QoS of GPU-based microservices due to the di erent communication patterns and shared resource contentions. We propose Astraea to manage GPU microservices considering the above factors. In Astraea, a microservice deployment policy is used to maximize the supported peak service load while ensuring the required QoS. To adaptively switch the communication methods between microservices according to di erent deployments, we propose an auto-scaling GPU communication framework. The framework automatically scales based on the currently used hardware topology and microservice location, and adopts global memory-based techniques to reduce intra-GPU communication. Astraea increases the supported peak load by up to 82.3% while achieving the desired 99%-ile latency target compared with state-of-the-art solutions.
CCS CONCEPTS• Computer systems organization → Cloud computing; Neural networks; • Networks → Cloud computing.