Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

Wahib, Mohamed; Zhang, Haoyu; Nguyen, Truong Thao; Drozd, Aleksandr; Domke, Jens; Zhang, Lingqi; Takano, Ryousei; Matsuoka, Satoshi

doi:10.48550/arxiv.2008.11421

Cited by 3 publications

(6 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Reduce required memory by using lower precision (quantization) [43], gradient checking point [6], out-of-core methods [49],…”

Section: Notationmentioning

confidence: 99%

“…Larger DL models, using ADAM optimizer in specific, may need a higher computation time for weight. Especially, large Transformer based models report up to 45% time on weight update and more than 60% extra memory requirements since ADAM requires four variables per weight [49]. One alternative to address this is to shard the weight update among GPUs across iterations, and Allgather the weights before forward/backward passes [52].…”

Section: 33mentioning

confidence: 99%

See 1 more Smart Citation

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

Kahira

Nguyen

Bautista-Gomez

et al. 2021

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism. CCS CONCEPTS• Computing methodologies → Parallel computing methodologies; Distributed computing methodologies; Machine learning.

show abstract

“…Reduce required memory by using lower precision (quantization) [43], gradient checking point [6], out-of-core methods [49],…”

Section: Notationmentioning

confidence: 99%

Section: 33mentioning

confidence: 99%

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

Kahira

Nguyen

Bautista-Gomez

et al. 2021

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this scenario, our work combines the tensor partition and rematerialization together for memory management on DCG. Although these two optimization techniques can both reduce memory footprint, they were developed separately [28,29] based on different application requirements. In fact, the tensor partition was designed for training a large model on multiple machines, which aimed to achieve load balance to maximize parallelism and locality to minimize network communication.…”

Section: Introductionmentioning

confidence: 99%

“…However, three complications make the combination non-trivial. (1) Since recent tensor partition plans [15,19,23,28,29,33] are usually defined before the execution and fixed during the training, these plans don't take into account the runtime information which essentially determines the tensor rematerialization. Hence, the first issue is how to achieve dynamic adjustment of tensor partition during the runtime.…”

Section: Introductionmentioning

confidence: 99%

MegTaiChi

Xiao

Deng

et al. 2022

Proceedings of the 36th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…Reduce required memory by using lower precision (quantization) [71], gradient checking point [89], out-of-core methods [90],…”

Section: • • • • • ---mentioning

confidence: 99%

Convergence of deep learning and high performance computing: challenges and solutions

Njoroge Kahira

View full text Add to dashboard Cite

Deep Learning has achieved outstanding results in many fields and led to groundbreaking discoveries. With the steady increase in datasets and model sizes, there has been a recent surge in Machine Learning applications in High-Performance Computing (HPC) to speed up training. Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models or using high dimension inputs. However, training DNN in HPC infrastructures presents a unique set of challenges: scalability, IO contention, network congestion and fault tolerance. Solving these problems is particularly challenging and unique due to DL applications’ nature and the history of adaptation of DL in HPC. This thesis addresses scalability and resilience challenges by looking at different parts of the Machine Learning Workflow. We first address hyper-parameters optimisation (HPO), which is one of the most time consuming and resource-intensive parts of a Machine Learning Workflow. We present a HPO scheme built on top of PyCOMPSs, a programming model and runtime which aims to ease the development of parallel applications for distributed infrastructures. We show that PyCOMPSs is a robust framework that can accelerate the process of Hyperparameter Optimisation across multiple devices and computing units. We perform a detailed performance analysis showing different configurations to demonstrate the effectiveness of our approach. We then analyse the compute, communication, and memory requirements of DNNs to understand the trade-offs of different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility that can help detect the limitations and bottlenecks of different parallelism approaches at scale. While significant effort has been put to facilitate distributed training by DL frameworks,fault tolerance has been largely ignored. We examine the checkpointing implementation of popular DL platforms. We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide take-away points that framework developers can use to facilitate better checkpointing of DL workloads in HPC. El Deep Learning ha logrado resultados sobresalientes en muchas aplicaciones y ha dado lugar a descubrimientos revolucionarios. Con el aumento constante del tamaño de las colecciones de datos y de los modelos, ha habido un reciente desarrollo de aplicaciones de Machine Learning en computación de alto rendimiento (HPC) que se enfocan en reducir el tiempo de entrenamiento de los modelos diseñados. Las librerías de Deep Neural Networks (DNN) utilizan el entrenamiento distribuido para reducir el tiempo de convergencia y aliviar las limitaciones de capacidad de memoria al entrenar modelos grandes o al utilizar entradas de gran dimensión. Sin embargo, capacitar a DNN en infraestructuras de HPC presenta una serie única de desafíos: escalabilidad, contención de E/S, congestión de la red y tolerancia a fallas. Resolver estos problemas es particularmente desafiante y único debido a la naturaleza de las aplicaciones DL y la historia de adaptación de DL en HPC. Esta tesis aborda los desafíos de escalabilidad y resiliencia al analizar el flujo de trabajo completo del Machine Learning. Primero abordamos la optimización de hiper-parámetros (HPO), que es una de las partes del flujo de trabajo de Machine Learning que consume más tiempo y recursos. Presentamos un esquema HPO construido sobre PyCOMPSs, un modelo de programación que tiene como objetivo facilitar el desarrollo de aplicaciones paralelas para infraestructuras distribuidas. Demostramos que PyCOMPSs es un marco robusto que puede acelerar el proceso de optimización de hiper-parámetros en múltiples dispositivos y unidades informáticas. Realizamos un detallado análisis de rendimiento que muestra diferentes configuraciones para demostrar la efectividad de nuestro enfoque. Luego, analizamos los requisitos de computación, comunicación y memoria de las DNN para comprender las compensaciones de los diferentes enfoques de paralelismo en el rendimiento y la escalabilidad. Aprovechamos nuestro análisis basado en modelos como base de una utilidad de Oracle que puede ayudar a detectar las limitaciones y los cuellos de botella de diferentes enfoques de paralelismo a escala. Si bien se ha realizado un esfuerzo significativo para facilitar el entrenamiento distribuido por los marcos de DL, la tolerancia a fallas se ha ignorado en gran medida. Examinamos la implementación de puntos de control de plataformas DL populares. Evaluamos el costo computacional de los puntos de control, los formatos y tamaños de los archivos, el impacto de la escala y los puntos de control deterministas. Proporcionamos puntos de discusión que pueden ayudar a los usuarios a seleccionar un marco tolerante a fallas para usar en HPC. También proporcionamos puntos de referencia que los desarrolladores de marcos pueden utilizar para facilitar un mejor control de las cargas de trabajo de DL en HPC.

show abstract

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

Cited by 3 publications

References 28 publications

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

MegTaiChi

Convergence of deep learning and high performance computing: challenges and solutions

Contact Info

Product

Resources

About