Negative future thinking pervades emotional disorders. This hybrid efficacy-effectiveness trial tested a four-session, scalable online cognitive bias modification program for training more positive episodic prediction. 958 adults (73.3% female, 86.5% White, 83.4% from United States) were randomized to positive conditions with ambiguous future scenarios that ended positively, 50/50 conditions that ended positively or negatively, or a control condition with neutral scenarios. As hypothesized (preregistration: https://osf.io/jrst6), positive training participants improved in negative and positive expectancy bias, self-efficacy, and optimism more than control participants, ds and 97.5% CIs = -0.57 [-0.87, -0.27], 0.79 [0.42, 1.15], 0.28 [0.02, 0.53], 0.28 [0.04, 0.51], and, for expectancy bias, more than 50/50 participants, with gains maintained at 1-month follow-up. Unexpectedly, participants across conditions improved comparably in anxiety and depression symptoms and growth mindset. Targeting a transdiagnostic process with a scalable program may improve bias and outlook; however, further validation of outcome measures is required.
This paper
1
studies how to schedule hyperparameters to improve generalization of both
centralized
single-machine stochastic gradient descent (SGD) and
distributed
asynchronous SGD (ASGD). SGD augmented with momentum variants (e.g., heavy ball momentum (SHB) and Nesterov’s accelerated gradient (NAG)) has been the default optimizer for many tasks, in both centralized and distributed environments. However, many advanced momentum variants, despite empirical advantage over classical SHB/NAG, introduce extra hyperparameters to tune. The error-prone tuning is the main barrier for AutoML.
Centralized
SGD
: We first focus on
centralized
single-machine SGD and show how to efficiently schedule the hyperparameters of a large class of momentum variants to improve generalization. We propose a unified framework called multistage quasi-hyperbolic momentum (Multistage QHM), which covers a large family of momentum variants as its special cases (e.g. vanilla SGD/SHB/NAG). Existing works mainly focus on only scheduling learning rate
α
’s decay, while multistage QHM allows additional varying hyperparameters (e.g., momentum factor), and demonstrates better generalization than only tuning
α
. We show the convergence of multistage QHM for general nonconvex objectives.
Distributed
SGD
: We then extend our theory to distributed asynchronous SGD (ASGD), in which a parameter server distributes data batches to several worker machines and updates parameters via aggregating batch gradients from workers. We quantify the asynchrony between different workers (i.e., gradient staleness), model the dynamics of asynchronous iterations via a stochastic differential equation (SDE), and then derive a PAC-Bayesian generalization bound for ASGD. As a byproduct, we show how a moderately large learning rate helps ASGD to generalize better.
Our tuning strategies have rigorous justifications rather than a blind trial-and-error as we theoretically prove why our tuning strategies could decrease our derived generalization errors in both cases. Our strategies simplify the tuning process and beat competitive optimizers in test accuracy empirically. Our codes are publicly available https://github.com/jsycsjh/centralized-asynchronous-tuning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.