With the modern advancements in Deep Learning architectures, and abundant research consistently being put forward in areas such as computer vision, natural language processing and forecasting. Models are becoming complicated and datasets are growing exponentially in size demanding high performing and faster computing machines from researchers and engineers. TensorFlow provides a wide range of distributed deep learning high-level APIs to address this issue, that can scale deep learning training from one machine to more than one. In this paper, we have investigated the performance of computing clusters utilizing those APIs. We created clusters of different sizes and discuss performance issues of distributed deep learning under high latency and poor communication conditions. To address the challenge of finding the optimal cluster for fast distributed deep learning, we have proposed a recommendation system, that can provide an optimal cluster size for fastest training time, given batch size and networking latency. Our results show that using a 2 machine cluster is both faster and cheaper than a four machine cluster for certain algorithms when network delay is high.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.