“…As a rule of thumb to fine-tune BERT for downstream tasks, Devlin et al ( 2019) suggested a minimal hyperparameter tuning strategy relying on a gridsearch on the following ranges: learning rate ∈ {2e−5, 3e−5, 4e−5, 5e−5}, number of training epochs ∈ {3, 4}, batch size ∈ {16, 32} and fixed dropout rate of 0.1. These not well justified suggestions are blindly followed in the literature Alsentzer et al, 2019;Beltagy et al, 2019;Sung et al, 2019). Given the relatively small size of the datasets, we use batch sizes ∈ {4, 8, 16, 32}.…”