Aggressive Synchronization with Partial Processing for Iterative ML Jobs on Clusters

Wang, Shaoqi; Chen, Wei; Pi, Aidi; Zhou, Xiaobo

doi:10.1145/3274808.3274828

Cited by 15 publications

(6 citation statements)

References 26 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While GeePS supports synchronous, bounded asynchronous and asynchronous parameter synchronization, it is designed to minimize the straggler problem on GPUs, and hence, achieves best convergence speed when using the synchronous approach. Wang et al [184] propose an aggressive synchronization scheme that is based on BSP, named A-BSP. Different from BSP, A-BSP allows the fastest task to fetch current updates generated by the other (straggler) tasks that have only partially processed their input data.…”

Section: Synchronizationmentioning

confidence: 99%

Scalable Deep Learning on Distributed Infrastructures

2020

View full text Add to dashboard Cite

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size of DL models and the proliferation of vast amounts of training data being available. To keep on improving the performance of DL, increasing the scalability of DL systems is necessary. In this survey, we perform a broad and thorough investigation on challenges, techniques and tools for scalable DL on distributed infrastructures. This incorporates infrastructures for DL, methods for parallel DL training, multitenant resource scheduling and the management of training and model data. Further, we analyze and compare 11 current open-source DL frameworks and tools and investigate which of the techniques are commonly implemented in practice. Finally, we highlight future research trends in DL systems that deserve further research.One of the driving factors of the success of DL is the scale of training in three dimensions. The first dimension of scale is the size and complexity of the models themselves. Starting from simple, shallow neural networks, with increasing depth and more sophisticated model architectures, new breakthroughs in model accuracy were achieved [30,38]. The second dimension of scale is the amount of training data. The model accuracy can, to a large extent, be improved by feeding more training data into the model [56,63]. In practice, it is reported that 10s to 100s of Terabyte (TB) of training data are used in the training of a DL model [27,62]. The third dimension is the scale of the infrastructure. The availability of programmable highly-parallel hardware, especially graphics processing units (GPUs), is a key-enabler to training large models with a lot of training data in a short time [30,206].Our survey is focused on challenges that arise when managing a large, distributed infrastructure for DL. Hosting a large amount of DL models that are trained with large amounts of training data is challenging. This includes questions of parallelization, resource scheduling and elasticity, data management and portability. This field is now in rapid development, with contributions from diverse research communities such as distributed and networked systems, data management, and machine learning. At the same time, we see a number of open source DL frameworks and orchestration systems emerging [4,24,141,195]. In this survey, we bring together, classify and compare the huge body of work on distributed infrastructures for DL from the different communities that contribute to this area. Furthermore, we provide an overview and comparison of the existing open-source DL frameworks and tools that put distributed DL into practice. Finally, we highlight and discuss open research challenges in this field. Complementary SurveysThere are a number of surveys on DL that are complementary to ours. Deng [41] provides a general survey on DL architectures, algorithms and applications. LeCunn et al. pro...

show abstract

Section: Synchronizationmentioning

confidence: 99%

Scalable Deep Learning on Distributed Infrastructures

2020

View full text Add to dashboard Cite

show abstract

“…SSP [12][61] enables processes to execute the training independently and allows fast workers to advance a bounded number of iterations ahead of slow workers. A-BSP [59] is proposed to aggressively synchronize parameters by applying the partial updates from slower workers. But all these approaches target on the centralized PS architecture.…”

Section: Related Workmentioning

confidence: 99%

Mitigating Stragglers in the Decentralized Training on Heterogeneous Clusters

Yang

Rang

Cheng

2020

Proceedings of the 21st International Middleware Conference

View full text Add to dashboard Cite

“…Recent efforts propose alternative synchronization models to mitigate the skewness. A-BSP [23] is a BSP-based aggressive synchronization model that uses updates from partial input data for synchronization. SSP [3], [24] uses flexible synchronization and allows any worker to be up to a bounded number of iterations ahead of the slowest worker.…”

Section: Related Workmentioning

confidence: 99%

Addressing Skewness in Iterative ML Jobs with Parameter Partition

Wang

Chen

Zhou

et al. 2019

IEEE INFOCOM 2019 - IEEE Conference on Computer Communications

Self Cite

View full text Add to dashboard Cite

Computational skewness is a significant challenge in multi-tenant data-parallel clusters that introduce dynamic heterogeneity of machine capacity in distributed data processing. Previous efforts to addressing skewness mostly focus on batch jobs based on the assumption that processing time is linearly dependent on the size of partitioned data. However, they are ill-suited for iterative machine learning (ML) jobs, which (1) exhibit a non-linear relationship between the size of partitioned parameters and processing time within each iteration, and (2) show an explicit binding relationship between input data and parameters for parameter update. In this paper, we present FlexPara, a parameter partition approach that leverages the non-linear relationship and provisions adaptive tasks to match the distinct machine capacity so as to address the skewness in iterative ML jobs on dataparallel clusters. FlexPara first predicts task processing time based on a capacity model designed for iterative ML jobs without the linear assumption. It then partitions parameters to parallel tasks through proactive parameter reassignment. Such reassignment can significantly reduce network transmission cost incurred by input data movement due to the binding relationship. We implement FlexPara in Spark and evaluate it with various ML jobs. Experimental results show that compared to hash partition, FlexPara speeds up the execution by up to 54% and 43% in private and NSF Chameleon clusters, respectively.

show abstract

Aggressive Synchronization with Partial Processing for Iterative ML Jobs on Clusters

Cited by 15 publications

References 26 publications

Scalable Deep Learning on Distributed Infrastructures

Scalable Deep Learning on Distributed Infrastructures

Mitigating Stragglers in the Decentralized Training on Heterogeneous Clusters

Addressing Skewness in Iterative ML Jobs with Parameter Partition

Contact Info

Product

Resources

About