vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design

Rhu, Minsoo; Gimelshein, Natalia; Clemons, Jason; Zulfiqar, Arslan; Keckler, Stephen W.

doi:10.1109/micro.2016.7783721

Cited by 275 publications

(212 citation statements)

References 28 publications

(49 reference statements)

Supporting

Mentioning

204

Contrasting

Order By: Relevance

“…As such, the (GBs of) NPU local memory will be large enough to preserve tens of preempted task's context state. If the multiple checkpointed state oversubscribes NPU memory, the approach taken by Rhu et al [39] can similarly be employed to handle memory oversubscription via copying overflowing data to the CPU memory. Concretely, when the runtime observes that NPU memory usage is nearing its limit, the DMA unit can proactively migrate some of the checkpointed state from NPU to CPU memory while the inference request is being serviced to hide migration overhead.…”

Section: G Storage Overhead Of Preemptionmentioning

confidence: 99%

PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units

Choi

Rhu

2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Self Cite

View full text Add to dashboard Cite

To amortize cost, cloud vendors providing DNN acceleration as a service to end-users employ consolidation and virtualization to share the underlying resources among multiple DNN service requests. This paper makes a case for a "preemptible" neural processing unit (NPU) and a "predictive" multitask scheduler to meet the latency demands of high-priority inference while maintaining high throughput. We evaluate both the mechanisms that enable NPUs to be preemptible and the policies that utilize them to meet scheduling objectives. We show that preemptive NPU multi-tasking can achieve an average 7.8×, 1.4×, and 4.8× improvement in latency, throughput, and SLA satisfaction, respectively.

show abstract

Section: G Storage Overhead Of Preemptionmentioning

confidence: 99%

PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units

Choi

Rhu

2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Memory-overlaying for DNN Virtual Memory. We implemented the runtime memory management policy as described in [9], [30], [10], [52], which leverages the network DAG to analyze inter-layer data dependency to schedule memory-overlaying operations for virtual memory. Under our implementation, the device memory is utilized as an application-level cache with respect to the host memory.…”

Section: Methodsmentioning

confidence: 99%

“…memory usage of DNNs [9], [10], [13], [14], [15], [16] have proposed to utilize both host and device memory concurrently for allocating data structures for DNN training. By leveraging the user-level DNN topology graph as means to extract a compile-time data dependency information (which is encapsulated as a direct acyclic graph (DAG) data structure) of the memory-hungry data structures, e.g., feature maps (X) and/or weights (W), DNN virtual memory can leverage this data dependency information to derive the DNN data reuse distance to schedule performance-aware data copy operations via memory-overlaying across host and device memory via PCIe [27], [28], [29].…”

Section: B Virtualizing Memory For Deep Learningmentioning

confidence: 99%

See 1 more Smart Citation

Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning

Kwon¹,

Rhu²

2018

2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Self Cite

View full text Add to dashboard Cite

As the models and the datasets to train deep learning (DL) models scale, system architects are faced with new challenges, one of which is the memory capacity bottleneck, where the limited physical memory inside the accelerator device constrains the algorithm that can be studied. We propose a memory-centric deep learning system that can transparently expand the memory capacity available to the accelerators while also providing fast inter-device communication for parallel training. Our proposal aggregates a pool of memory modules locally within the device-side interconnect, which are decoupled from the host interface and function as a vehicle for transparent memory capacity expansion. Compared to conventional systems, our proposal achieves an average 2.8× speedup on eight DL applications and increases the system-wide memory capacity to tens of TBs.

show abstract

“…The first approach involves using the data-swapping method which is proposed in this paper. M. N. Rhu et al [7] and Meng et al [8] also used this approach. They used popular neural networks such as ResNet50 for evaluation and basically focused on the increase in batch size.…”

Section: Related Workmentioning

confidence: 99%

High Resolution Medical Image Segmentation Using Data-Swapping Method

Imai

Matzek²,

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Deep neural network models used for medical image segmentation are large because they are trained with high-resolution three-dimensional (3D) images. Graphics processing units (GPUs) are widely used to accelerate the trainings. However, the memory on a GPU is not large enough to train the models. A popular approach to tackling this problem is patch-based method, which divides a large image into small patches and trains the models with these small patches. However, this method would degrade the segmentation quality if a target object spans multiple patches. In this paper, we propose a novel approach for 3D medical image segmentation that utilizes the data-swapping, which swaps out intermediate data from GPU memory to CPU memory to enlarge the effective GPU memory size, for training high-resolution 3D medical images without patching. We carefully tuned parameters in the data-swapping method to obtain the best training performance for 3D U-Net, a widely used deep neural network model for medical image segmentation. We applied our tuning to train 3D U-Net with fullsize images of 192 × 192 × 192 voxels in brain tumor dataset. As a result, communication overhead, which is the most important issue, was reduced by 17.1%. Compared with the patch-based method for patches of 128 × 128 × 128 voxels, our training for full-size images achieved improvement on the mean Dice score by 4.48% and 5.32 % for detecting whole tumor sub-region and tumor core subregion, respectively. The total training time was reduced from 164 hours to 47 hours, resulting in 3.53 times of acceleration.

show abstract

vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design

Cited by 275 publications

References 28 publications

PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units

PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units

Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning

High Resolution Medical Image Segmentation Using Data-Swapping Method

Contact Info

Product

Resources

About