“…The architecture of the network and other learning hyper-parameters were determined using grid search. The search space was defined as follows: {Input history length = [1, 2, and 3 hours], LSTM units = [32,64,128,256,512], hidden dense layers = [1,2,3,4,5], hidden units in the first dense layer = [512,256,128,64,32], learning rate = [1e-5, 1e-4, 1e-3], batch size = [32,64,128]}. To prevent overfitting, we used early stopping (i.e., training was stopped when the MSE of the validation dataset stopped improving or got worse, indicating that the network had started to memorize the training data).…”