Gradient-based distributed learning in Parameter Server (PS) computing architectures is subject to random delays due to straggling worker nodes, as well as to possible communication bottlenecks between PS and workers. Solutions have been recently proposed to separately address these impairments based on the ideas of gradient coding, worker grouping, and adaptive worker selection. This paper provides a unified analysis of these techniques in terms of wall-clock time, communication, and computation complexity measures. Furthermore, in order to combine the benefits of gradient coding and grouping in terms of robustness to stragglers with the communication and computation load gains of adaptive selection, novel strategies, named Lazily Aggregated Gradient Coding (LAGC) and Grouped-LAG (G-LAG), are introduced. Analysis and results show that G-LAG provides the best wall-clock time and communication performance, while maintaining a low computational cost, for two representative distributions of the computing times of the worker nodes. times at the workers can cause significant slowdowns in wall-clock run-time per iteration due to straggling workers [6]. Second, the communication overhead resulting from intensive two-way communications between the PS and the workers may require significant networking resources to be available in order not to dominate the overall run-time [7].Recently, solutions have been developed that aim at improving robustness to stragglers -namely Gradient Coding (GC) and grouping [8], [9] -or communication load -namely adaptive selection [10] (see Table I for a summary). GC, introduced The authors are with theTABLE I: Qualitative comparisons with respect to standard (distributed) Gradient Descent (GD) Coding Grouping Adaptive selection Robustness to stragglers better better same Communication load same same better Computation load worse worse better[8], increases robustness to stragglers by leveraging storage and computation redundancy at the worker nodes as compared to standard (distributed) Gradient Descent (GD) [11]. With a redundancy factor r > 1, each worker stores, and computes on, r times more data than with GD. Under GC, given a redundancy factor r > 1, up to r − 1 stragglers can be tolerated, while still allowing the PS to exactly compute the gradient at any iteration. GC requires coding the computed gradients prior to communication from the workers to the PS, as well as decoding at the PS.As a special case of GC, given a redundancy factor r equal to the number M of workers, each worker can store the entire dataset. Hence, the gradient can be obtained from any worker without requiring any coding or decoding operation. In the typical case in which r is smaller than M , the same simple procedure can be applied to groups of workers. In particular, given a redundancy factor r, the dataset can be partitioned so that each partition is available to all nodes of a group of r workers.The PS can then recover the gradient upon receiving the computations of any server for each group. The outlined group...