Training deep neural networks: a static load balancing approach

Moreno‐Álvarez, Sergio; Haut, Juan M.; Paoletti, Mercedes E.; Rico‐Gallego, Juan‐Antonio; Díaz‐Martín, Juan C.; Plaza, Javier

doi:10.1007/s11227-020-03200-6

Cited by 13 publications

(8 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These techniques have been extrapolated to develop similar versions for distributed neural networks. In this context, computation awareness has been improved for data parallelism [4] and model parallelism [29]. As a consequence, partitioning the workload between resources from different nodes could produce communication bottlenecks that should be handled.…”

Section: Related Workmentioning

confidence: 99%

Enhancing Distributed Neural Network Training Through Node-Based Communications

Moreno-Álvarez,

Paoletti,

Cavallaro

et al. 2024

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

The amount of data needed to effectively train modern deep neural architectures has grown significantly, leading to increased computational requirements. These intensive computations are tackled by the combination of last generation computing resources, such as accelerators, or classic processing units. Nevertheless, gradient communication remains as the major bottleneck, hindering the efficiency notwithstanding the improvements in runtimes obtained through data parallelism strategies. Data parallelism involves all processes in a global exchange of potentially high amount of data, which may impede the achievement of the desired speedup and the elimination of noticeable delays or bottlenecks. As a result, communication latency issues pose a significant challenge that profoundly impacts the performance on distributed platforms. This research presents node-based optimization steps to significantly reduce the gradient exchange between model replicas whilst ensuring model convergence. The proposal serves as a versatile communication scheme, suitable for integration into a wide range of general-purpose deep neural network (DNN) algorithms. The optimization takes into consideration the specific location of each replica within the platform. To demonstrate the effectiveness, different neural network approaches and datasets with disjoint properties are used. In addition, multiple types of applications are considered to demonstrate the robustness and versatility of our proposal. The experimental results show a global training time reduction whilst slightly improving accuracy. Code: https://github.com/mhaut/eDNNcomm.

show abstract

Section: Related Workmentioning

confidence: 99%

Enhancing Distributed Neural Network Training Through Node-Based Communications

Moreno-Álvarez,

Paoletti,

Cavallaro

et al. 2024

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

show abstract

“…However, in the case of GPU workers, batch processing time and batch size are not proportional, so the batch size is determined by numerical approximation using a static step size to adjust the batch size. A study that allocates a batch size proportional to the performance of each computing node using a static loadbalancing technique has been presented [32]. BOA (Batch Orchestration Algorithm) adaptively adjusts the batch size according to the worker's speed to alleviate both static and dynamic stragglers [10].…”

Section: Straggler Mitigationmentioning

confidence: 99%

Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud

Kim

Lee

et al. 2022

IEEE Access

View full text Add to dashboard Cite

In order to cope with the growing scale of deep neural network (DNN) models and training data, the use of cloud computing for distributed DNN training is becoming increasingly popular. The amount of available resources in a cloud continuously changes according to users' demands. Although distributed DNN training has a long execution time ranging from several hours to several days, existing frameworks cannot provide a dynamic scale function or have high scale in/out overhead. Therefore, it is difficult to achieve higher performance by adding graphics processing unit (GPU) nodes to a running training cluster, even when surplus GPU resources become available. In addition, the inability to dynamically reconfigure the training cluster prohibits the reform of the cluster topology when it was sub-optimally created. This paper proposes a dynamic scaling technique with which the dynamic addition and removal of new workers can be performed without suspending the ongoing training job. In addition, we propose a heterogeneityaware straggler-proof technique so that, even when the performance of the GPUs in the cloud are uneven, a performance benefit can be guaranteed through the addition of the surplus resources. The proposed scheme improved throughput by up to a factor of 17.52 during scaling out the existing cluster of five workers to ten compared to the existing checkpoint-based scheme. Furthermore, training was continued at 95.52% of the maximum performance while being stopped for 841 seconds in Elastic Horovod, which supports dynamic scaling. Finally, even when GPUs of different performances were mixed, the error between the determined batch size and the optimal batch size was 3.37% on average. INDEX TERMS distributed training, neural networks, dynamic scaling, heterogeneous cloud, cluster management, ring-allreduce VOLUME x, 2019

show abstract

“…In addition, there is a synchronization waiting problem during model training. 10 Model parallelism 11,12 divides a specific layer of tensor and processes a specific layer of a network by multiple computing nodes or processes simultaneously to split a large network layer into multiple relatively small tensor parallel computations. This approach does not require loading the entire model into edge nodes, which facilitates the training of larger models.…”

Section: Related Workmentioning

confidence: 99%

“…In data parallelism, nodes need to maintain the entire model parameters, and large convolutional networks cannot be loaded in edge devices. In addition, there is a synchronization waiting problem during model training 10 . Model parallelism 11,12 divides a specific layer of tensor and processes a specific layer of a network by multiple computing nodes or processes simultaneously to split a large network layer into multiple relatively small tensor parallel computations.…”

Section: Related Workmentioning

confidence: 99%

EdgeMesh: A hybrid distributed training mechanism for heterogeneous edge devices

Feng

Zhang

et al. 2022

Trans Emerging Tel Tech

View full text Add to dashboard Cite

The proliferation of large‐scale distributed Internet of Things (IoT) applications has resulted in a surge in demand for network models such as deep neural networks (DNNs) to be trained and inferred at the edge. Due to the central data transmission mechanism, heterogeneity of edge devices, and resource constraints, the existing single data‐parallel, and model‐parallel distributed training mechanisms frequently fail to fully utilize the computing power of edge devices, network topology and bandwidth resources. In light of the shortcomings mentioned earlier, this article proposes EdgeMesh, a hybrid parallel training mode based on the Mesh‐Tensorflow framework, consisting of an adaptive meshing strategy and a dynamic model convolutional partitioning strategy. The computing power of IoT edge devices significantly speeds up the DNN training process. In a resource‐constrained environment, each node only supports a subset of the model's parallel computing tasks, reducing communication and memory overhead while retaining high scalability. Experiments show that when compared to single‐machine training and data parallel mode, EdgeMesh distributed training mechanism can reduce the average delay by 3.2 times and average memory overhead by 43% while maintaining model accuracy. The computing power of IoT edge devices effectively accelerates the DNN training process.

show abstract

Training deep neural networks: a static load balancing approach

Cited by 13 publications

References 12 publications

Enhancing Distributed Neural Network Training Through Node-Based Communications

Enhancing Distributed Neural Network Training Through Node-Based Communications

Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud

EdgeMesh: A hybrid distributed training mechanism for heterogeneous edge devices

Contact Info

Product

Resources

About