Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster

Campos, Víctor; Sastre, Francesc; Yagües, Maurici; Bellver, Miriam; Giró-i-Nieto, Xavier; Torres, Jordi

doi:10.1016/j.procs.2017.05.074

Cited by 37 publications

(11 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[26] It is obvious that having access to GPU clusters is a must to deploy deep networks in practice. [272829] Pathology laboratories, however, are already under immense financial pressure to adopt WSI technology, and acquiring and storing gigapixel histopathological scans is a formidable challenge to the adoption of digital pathology. Asking for GPUs, as a prerequisite for training or using deep AI solutions, is consequently going to be financially limiting in the foreseeable future.…”

Section: Challengesmentioning

confidence: 99%

Artificial Intelligence and Digital Pathology: Challenges and Opportunities

Tizhoosh

Pantanowitz

2018

Journal of Pathology Informatics

353

289

View full text Add to dashboard Cite

In light of the recent success of artificial intelligence (AI) in computer vision applications, many researchers and physicians expect that AI would be able to assist in many tasks in digital pathology. Although opportunities are both manifest and tangible, there are clearly many challenges that need to be overcome in order to exploit the AI potentials in computational pathology. In this paper, we strive to provide a realistic account of all challenges and opportunities of adopting AI algorithms in digital pathology from both engineering and pathology perspectives.

show abstract

Section: Challengesmentioning

confidence: 99%

Artificial Intelligence and Digital Pathology: Challenges and Opportunities

Tizhoosh

Pantanowitz

2018

Journal of Pathology Informatics

353

289

View full text Add to dashboard Cite

show abstract

“…A learning rate value proportional to the batch size, warmup learning rate behaviour, batch normalization, SGD to RMSProp optimizer transition are some of the techniques exposed in these works. A study of the distributed training methods using ResNet-50 architecture on a HPC cluster is shown in [10,11]. To know more about the algorithms used in this field we refer to [8].…”

Section: Related Workmentioning

confidence: 99%

Improving Accuracy and Speeding Up Document Image Classification Through Parallel Systems

Ferrando

Domínguez

Torres

et al. 2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

This paper presents a study showing the benefits of the Effi-cientNet models compared with heavier Convolutional Neural Networks (CNNs) in the Document Classification task, essential problem in the digitalization process of institutions. We show in the RVL-CDIP dataset that we can improve previous results with a much lighter model and present its transfer learning capabilities on a smaller in-domain dataset such as Tobacco3482. Moreover, we present an ensemble pipeline which is able to boost solely image input by combining image model predictions with the ones generated by BERT model on extracted text by OCR. We also show that the batch size can be effectively increased without hindering its accuracy so that the training process can be sped up by parallelizing throughout multiple GPUs, decreasing the computational time needed. Lastly, we expose the training performance differences between PyTorch and Tensorflow Deep Learning frameworks.

show abstract

“…2) Distributed Deep Learning: In previous works [7], we explored how distributed learning can help to speed up training for neural networks. Several work on spatial, model and data parallelism has been done during the recent years [8], [9], [10], including the implementation of these techniques into the most used deep learning frameworks, e.g.…”

Section: State Of the Artmentioning

confidence: 99%

Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation

Berral¹,

Oriol²,

Domínguez³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Most research on novel techniques for 3D Medical Image Segmentation (MIS) is currently done using Deep Learning with GPU accelerators. The principal challenge of such technique is that a single input can easily cope computing resources, and require prohibitive amounts of time to be processed. Distribution of deep learning and scalability over computing devices is an actual need for progressing on such research field. Conventional distribution of neural networks consist in "data parallelism", where data is scattered over resources (e.g., GPUs) to parallelize the training of the model. However, "experiment parallelism" is also an option, where different training processes (i.e., on a hyper-parameter search) are parallelized across resources. While the first option is much more common on 3D image segmentation, the second provides a pipeline design with less dependence among parallelized processes, allowing overhead reduction and more potential scalability. In this work we present a design for distributed deep learning training pipelines, focusing on multinode and multi-GPU environments, where the two different distribution approaches are deployed and benchmarked. We take as proof of concept the 3D U-Net architecture, using the MSD Brain Tumor Segmentation dataset, a state-of-art problem in medical image segmentation with high computing and space requirements. Using the BSC MareNostrum supercomputer as benchmarking environment, we use TensorFlow and Ray as neural network training and experiment distribution platforms. We evaluate the experiment speed-up when parallelizing, showing the potential for scaling out on GPUs and nodes. Also comparing the different parallelism techniques, showing how experiment distribution leverages better such resources through scaling, e.g. by a speed-up factor from x12 to x14 using 32 GPUs. Finally, we provide the implementation of the design open to the community, and the non-trivial steps and methodology for adapting and deploying a MIS case as the here presented.

show abstract

Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster

Cited by 37 publications

References 4 publications

Artificial Intelligence and Digital Pathology: Challenges and Opportunities

Artificial Intelligence and Digital Pathology: Challenges and Opportunities

Improving Accuracy and Speeding Up Document Image Classification Through Parallel Systems

Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation

Contact Info

Product

Resources

About