Transfer of recent advances in deep reinforcement learning to real-world
applications is hindered by high data demands and thus low efficiency and
scalability.
Through independent improvements of components such as replay buffers or
more stable learning algorithms, and through massively distributed systems,
training time could be reduced from several days to several hours for standard
benchmark tasks.
However, while rewards in simulated environments are well-defined and easy
to compute, reward evaluation becomes the bottleneck in many real-world
environments, e.g., in molecular optimization tasks, where computationally
demanding simulations or even experiments are required to evaluate
states and to quantify rewards.
When ground-truth evaluations become orders of magnitude more expensive than
in research scenarios, direct transfer of recent advances would require massive amounts
of scale, just for evaluating rewards rather than training the models.
We propose to alleviate this problem by replacing costly ground-truth
rewards with rewards modeled by neural networks, counteracting
non-stationarity of state and reward distributions during training with an
active learning component.
We demonstrate that using our proposed method, it is possible to train agents
in complex real-world environments orders of magnitudes faster than would be
possible when using ground-truth rewards.
By enabling the application of reinforcement learning methods to new
domains, we show that we can find interesting and non-trivial solutions to
real-world optimization problems in chemistry, materials science and
engineering.
We demonstrate speed-up factors of 50 to 3000 when applying our approach to
challenges of molecular design and airfoil optimization.