A Multi-Neural Network Acceleration Architecture

Baek, Eunjin; Kwon, Dongup; Kim, Jangwoo

doi:10.1109/isca45697.2020.00081

Cited by 74 publications

(53 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…So, to fully utilize the hardware and increase the energy efficiency, increasing deep learning service providers begin to introduce the task-level parallelism by sharing one computing hardware among multiple customers/requests which is called multi-tenant deep learning service, by either temporal multiplexing (e.g. PREMA [12], AI-MT [2]) or spatial multiplexing (e.g. NVIDIA Multi-Process Service [44], NVIDIA Ampere Multi-Instance GPU [47]).…”

Section: Dnn Execution Characterization On Cpumentioning

confidence: 99%

“…Our main finding is two-fold. First, a fixed scheduling granularity, such as the entire model [12] or the sub-layer block [2,21], leads to the sub-optimal performance, owing to the diversity of DNN models and their distinctive inner characteristics. Second, the performance of the existing compilation strategies, which aim to maximize the code performance under the solo-run case, degrades significantly when multiple DNN models run together and interfere with each other.…”

Section: Optimization Space Analysismentioning

confidence: 99%

“…Different from the computation-heavy training process, it is difficult for the inference of a single deep learning model to fully use the hardware, which typically runs with a small batch size [2]. As such, sharing multiple DL models on a single hardware, i.e., multi-tenant deep learning serving, has become increasingly important [12,21].…”

Section: Introductionmentioning

confidence: 99%

“…For conventional multi-tenant workloads, researchers have proposed various solutions based on resource partition [6], hardware isolation [34], and so on. Similarly, researchers have proposed various architectural support for multi-tenant DL serving [2,12,21] that leverages temporal and spatial multitasking.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling

Liu

Leng

Zhang

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Deep learning (DL) models have achieved great success in many application domains. As such, many industrial companies such as Google and Facebook have acknowledged the importance of multitenant DL services. Although the multi-tenant service has been studied in conventional workloads, it is not been deeply studied on deep learning service, especially on general-purpose hardware.In this work, we systematically analyze the opportunities and challenges of providing multi-tenant deep learning services on the general-purpose CPU architecture from the aspects of scheduling granularity and code generation. We propose an adaptive granularity scheduling scheme to both guarantee resource usage efficiency and reduce the scheduling conflict rate. We also propose an adaptive compilation strategy, by which we can dynamically and intelligently pick a program with proper exclusive and shared resource usage to reduce overall interference-induced performance loss. Compared to the existing works, our design can serve more requests under the same QoS target in various scenarios (e.g., +71%, +62%, +45% for light, medium, and heavy workloads, respectively), and reduce the averaged query latency by 50%. CCS CONCEPTS• Computer systems organization → Neural networks; Cloud computing; • Computing methodologies → Concurrent algorithms.

show abstract

Section: Dnn Execution Characterization On Cpumentioning

confidence: 99%

Section: Optimization Space Analysismentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling

Liu

Leng

Zhang

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

show abstract

“…Due to the wide spectrum of NPUs and DNN models, it is nearly impossible to balance the usage of NPU resources for all DNNs. Recently, several proposals have addressed this problem by time-multiplexing layer-wise execution of multiple DNN models with opposing characteristics (e.g., memory-intensive and compute-intensive) to saturate both compute and memory bandwidth [5], [12]. Figure 1 vidual requests.…”

Section: Limitations Of the Prior Artmentioning

confidence: 99%

Layerweaver+: A QoS-Aware Layer-Wise DNN Scheduler for Multi-Tenant Neural Processing Units

Jin

Ham

et al. 2022

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Many cloud service providers employ specialized hardware accelerators, called neural processing units (NPUs), to accelerate deep neural networks (DNNs). An NPU scheduler is responsible for scheduling incoming user requests and required to satisfy the two, often conflicting, optimization goals: maximizing system throughput and satisfying quality-of-service (QoS) constraints (e.g., deadlines) of individual requests. We propose Layerweaver+, a low-cost layer-wise DNN scheduler for NPUs, which provides both high system throughput and minimal QoS violations. For a serving scenario based on the industry-standard MLPerf inference benchmark, Layerweaver+ significantly improves the system throughput by up to 266.7% over the baseline scheduler serving one DNN at a time.

show abstract

Throughput as Well as Latency Improvement Method of Processor or Chip Accelerator

Dharane¹,

Shiurkar²

2022

Wireless Pers Commun

View full text Add to dashboard Cite

A Multi-Neural Network Acceleration Architecture

Cited by 74 publications

References 43 publications

VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling

VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling

Layerweaver+: A QoS-Aware Layer-Wise DNN Scheduler for Multi-Tenant Neural Processing Units

Throughput as Well as Latency Improvement Method of Processor or Chip Accelerator

Contact Info

Product

Resources

About