Randomness In Neural Network Training: Characterizing The Impact of Tooling

Zhuang, Donglin; Zhang, Xingyao; Song, Shuaiwen Leon; Hooker, Sara

doi:10.48550/arxiv.2106.11872

Cited by 8 publications

(13 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Generally, machine learning experiments are not precisely predictable -complex models trained on complex data typically yield noisy or variable results [79,17]. 8 Though individual experiments may be unpredictable, the general performance of large generative models tends to exhibit smooth and predictable growth as a function of scale -larger systems tend to do increasingly better on a broad range of tasks.…”

Section: Smooth General Capability Scalingmentioning

confidence: 99%

Predictability and Surprise in Large Generative Models

Ganguli¹,

Hernandez²,

Lovitt³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large-scale pre-training has recently emerged as a technique for creating capable, generalpurpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusual combination of predictable loss on a broad training distribution (as embodied in their "scaling laws"), and unpredictable specific capabilities, inputs, and outputs. We believe that the high-level predictability and appearance of useful capabilities drives rapid development of such models, while the unpredictable qualities make it difficult to anticipate the consequences of model deployment. We go through examples of how this combination can lead to socially harmful behavior with examples from the literature and real world observations, and we also perform two novel experiments to illustrate our point about harms from unpredictability. Furthermore, we analyze how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment. We conclude with a list of possible interventions the AI community may take to increase the chance of these models having a beneficial impact. We intend this paper to be useful to policymakers who want to understand and regulate AI systems, technologists who care about the potential policy impact of their work, and academics who want to analyze, critique, and potentially develop large generative models.

show abstract

Section: Smooth General Capability Scalingmentioning

confidence: 99%

Predictability and Surprise in Large Generative Models

Ganguli¹,

Hernandez²,

Lovitt³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Reproducibility: Many factors contribute to irreproducibility in deep models [13,18,19,44,48,49,56]. The highly non-convex objective [18], combined with nondterminism in training [49] and underspecificaiton [13] of over-parameterized deep networks, can lead training models to optima at different locations in a manifold or sets of optima.…”

Section: Related Work and Productionalizationmentioning

confidence: 99%

“…The highly non-convex objective [18], combined with nondterminism in training [49] and underspecificaiton [13] of over-parameterized deep networks, can lead training models to optima at different locations in a manifold or sets of optima. Nondeterminism can emerge from the highly parallelized, highly distributed training pipelines, quantization errors, hardware types [56] and more. Slight deviations early in training due to these can lead to very different models [1]).…”

Section: Related Work and Productionalizationmentioning

confidence: 99%

Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations

Shamir¹,

Lin²

2022

Preprint

View full text Add to dashboard Cite

Real world recommendation systems influence a constantly growing set of domains. With deep networks, that now drive such systems, recommendations have been more relevant to the user's interests and tasks. However, they may not always be reproducible even if produced by the same system for the same user, recommendation sequence, request, or query. This problem received almost no attention in academic publications, but is, in fact, very realistic and critical in real production systems. We consider reproducibility of real large scale deep models, whose predictions determine such recommendations. We demonstrate that the celebrated Rectified Linear Unit (ReLU) activation, used in deep models, can be a major contributor to irreproducibility. We propose the use of smooth activations to improve recommendation reproducibility. We describe a novel family of smooth activations; Smooth ReLU (SmeLU ), designed to improve reproducibility with mathematical simplicity, with potentially cheaper implementation. SmeLU is a member of a wider family of smooth activations. While other techniques that improve reproducibility in real systems usually come at accuracy costs, smooth activations not only improve reproducibility, but can even give accuracy gains. We report metrics from real systems in which we were able to productionalize SmeLU with substantial reproducibility gains and better accuracy-reproducibility trade-offs. These include click-through-rate (CTR) prediction systems, content, and application recommendation systems.

show abstract

“…In most experiments, there is inherent randomness in the scores obtained from different runs. This randomness can arise from stochasticity in the task, exploratory choices made during learning, randomized initial parameters, but also software and hardware considerations such as non-determinism in GPUs and in machine learning frameworks [113]. Thus, we model the algorithm's normalized score on the m th task as a real-valued random variable X m .…”

Section: Formalismmentioning

confidence: 99%

Deep Reinforcement Learning at the Edge of the Statistical Precipice

Agarwal¹,

Schwarzer²,

Castro³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few-run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable 2 , to prevent unreliable results from stagnating the field.

show abstract

Randomness In Neural Network Training: Characterizing The Impact of Tooling

Cited by 8 publications

References 36 publications

Predictability and Surprise in Large Generative Models

Predictability and Surprise in Large Generative Models

Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations

Deep Reinforcement Learning at the Edge of the Statistical Precipice

Contact Info

Product

Resources

About